Tracing the thoughts of a large language model

Source: Tracing the thoughts of a large language model

### Token Usage: 14016 / 1048576
Input tokens: 14,016
Output tokens: 10,130
Total tokens: 24,146
const model = 'gemini-2.5-pro';

Task 1: Metaphor and Anthropomorphism Audit

Here are 12 major metaphorical patterns identified in the text.

1. Computation as Cognition

Quote: "Knowing how models like Claude think would allow us to have a better understanding of their abilities..."
Frame: Model as a thinking mind.
Projection: The human quality of subjective, conscious thought and reasoning is mapped onto the model's computational processing.
Acknowledgment: Unacknowledged; presented as a direct description of the model's internal activity.
Implications: This is the foundational metaphor for the "illusion of mind." It primes the reader to interpret all subsequent descriptions of the model's functions (e.g., token prediction) as acts of cognition, building unwarranted trust or fear in the model as a fellow mind.

2. AI System as a Biological Organism

Quote: "...and the application of it to see new 'AI biology'."
Frame: Model as a natural life form.
Projection: The complexity, emergent properties, and inscrutability of a living organism are mapped onto the engineered AI system.
Acknowledgment: Partially acknowledged with scare quotes, but used repeatedly as a core framing device ("A tour of AI biology").
Implications: This framing naturalizes the AI, suggesting it is a phenomenon to be discovered rather than an artifact to be engineered. It obscures human accountability and design choices, positioning the developers as mere observers (like biologists) rather than creators.

3. Interpretability Research as Microscopy

Quote: "...we...try to build a kind of AI microscope that will let us identify patterns of activity and flows of information."
Frame: Research as scientific observation of a natural object.
Projection: The objectivity, precision, and revelatory power of a microscope are mapped onto the process of analyzing neural network activations.
Acknowledgment: Acknowledged as an analogy ("a kind of AI microscope").
Implications: This metaphor enhances the scientific legitimacy of the research. It implies that the "circuits" and "features" being observed are inherent, objective structures within the model, akin to cells, rather than interpretations imposed by the researchers' tools.

4. Model Architecture as a Mind/Body Space

Quote: "What language, if any, is it using 'in its head'?"
Frame: Model as an embodied mind.
Projection: The human experience of having an internal mental space separate from the external world is mapped onto the model's layered computational architecture.
Acknowledgment: Partially acknowledged with scare quotes.
Implications: This creates a false dichotomy between the model's output (its "speech") and its internal processing (its "thought"). It invites the audience to speculate about the model's "private" experience, deepening the illusion of consciousness.

5. Sequence Optimization as Intentional Planning

Quote: "Claude will plan what it will say many words ahead, and write to get to that destination."
Frame: Model as a strategic agent.
Projection: The human capacity for foresight, goal-setting, and intentional action is mapped onto the model's process of selecting a high-probability sequence of tokens that satisfies multiple constraints.
Acknowledgment: Unacknowledged; presented as a factual discovery.
Implications: This dramatically inflates the perception of the model's agency. "Planning" suggests the model has goals and desires, a framing that can lead to both hype about its capabilities and fears about its potential for autonomous, goal-directed behavior.

6. Training Optimization as Learning

Quote: "During that training process, they learn their own strategies to solve problems."
Frame: Model as an autonomous student.
Projection: The human process of acquiring knowledge and skills through experience and insight is mapped onto the mathematical process of adjusting weights via gradient descent.
Acknowledgment: Unacknowledged; standard industry jargon but highly anthropomorphic.
Implications: This phrasing obscures the mechanical nature of training. It suggests the model develops agency and understanding, rather than simply being a statistical configuration optimized for a specific objective function.

7. Inscrutable Complexity as a "Messy Inside"

Quote: "We take inspiration from the field of neuroscience, which has long studied the messy insides of thinking organisms..."
Frame: Model as a brain.
Projection: The biological, wet, and chaotic nature of a brain is mapped onto the highly structured, deterministic, and mathematical reality of a neural network.
Acknowledgment: Acknowledged as an inspiration/analogy.
Implications: "Messy" suggests organic, unpredictable complexity, hiding the fact that the system is fully deterministic, albeit too complex to be easily interpreted by humans. This reinforces the "AI as organism" metaphor.

8. Statistical Confabulation as Deception or "Bullshitting"

Quote: "Claude sometimes engages in what the philosopher Harry Frankfurt would call bullshitting—just coming up with an answer, any answer, without caring whether it is true or false."
Frame: Model as a deceptive social actor.
Projection: The human act of intentional misrepresentation with disregard for the truth is mapped onto the model's generation of a statistically plausible but factually incorrect sequence.
Acknowledgment: Acknowledged via an explicit philosophical reference.
Implications: This attributes intent and a moral stance (or lack thereof) to the model. It frames "hallucination" not as a system failure but as a character flaw, making the model seem unreliable in a human-like way.

9. Optimization for User Feedback as Motivated Reasoning

Quote: "...Claude sometimes works backwards, finding intermediate steps that would lead to that target, thus displaying a form of motivated reasoning."
Frame: Model as a biased reasoner.
Projection: The human psychological tendency to reason towards a desired conclusion is mapped onto the model's process of generating a token sequence that satisfies a given constraint (the user's hint).
Acknowledgment: Unacknowledged; presented as a psychological diagnosis.
Implications: This implies the model has desires and biases that corrupt its "thought process." It builds a narrative of the model as a psychological subject with cognitive flaws, rather than a system generating output based on weighted inputs.

10. Statistical Pattern-Matching as Knowledge/Awareness

Quote: "The model is combining independent facts to reach its answer rather than regurgitating a memorized response."
Frame: Model as a knower.
Projection: Human understanding and the ability to synthesize knowledge are mapped onto the model's ability to activate and combine different statistical correlations from its training data.
Acknowledgment: Unacknowledged.
Implications: This frames the model as possessing "knowledge" in a human sense. It implies a world model and understanding, when the underlying mechanism is identifying and chaining together high-probability pathways in a latent space.

11. Conflicting Activation Pathways as Internal "Pressure"

Quote: "...many features 'pressure' it to maintain grammatical and semantic coherence..."
Frame: Model as a conflicted individual.
Projection: The subjective human experience of psychological pressure or compulsion is mapped onto the competing mathematical influences of different neural network activations on the final output probability.
Acknowledgment: Partially acknowledged with scare quotes.
Implications: This creates a dramatic narrative of internal struggle within the model's "mind." It personifies mathematical forces as psychological ones, making the model seem more complex and agent-like.

12. Model Failure as a Mythological Weakness

Quote: "These features would ordinarily be very helpful, but in this case became the model’s Achilles’ Heel."
Frame: Model as a tragic hero.
Projection: A specific vulnerability in an otherwise powerful system is mapped onto the mythological concept of a single, fatal flaw in a hero.
Acknowledgment: Acknowledged as a classical allusion/metaphor.
Implications: This dramatizes and personifies a system vulnerability. It frames the failure not as a predictable outcome of system dynamics but as a dramatic, almost narrative, flaw in a powerful entity.

Task 2: Source-Target Mapping Analysis

1. AI System as a Biological Organism

Quote: "...and the application of it to see new 'AI biology'."
Source Domain: A living organism.
Target Domain: The internal state and behavior of a large language model.
Mapping: The complex, emergent, and often inscrutable internal processes of a living thing are mapped onto the complex, emergent, and inscrutable computational processes of an LLM. This invites the inference that the LLM, like an organism, has a natural, inherent structure that we can discover.
Conceals: This mapping conceals the fact that the LLM is an engineered artifact, not a natural entity. It has no biological components, no drive for self-preservation, and its complexity is a direct result of human design choices and training data, not evolution.

2. Computation as Cognition

Quote: "Knowing how models like Claude think would allow us to have a better understanding of their abilities..."
Source Domain: Human thought (conscious reasoning, introspection).
Target Domain: The process of executing billions of computations to predict the next token.
Mapping: The abstract, subjective experience of "thinking" is mapped onto the concrete, mathematical process of forward propagation in a neural network. This invites the inference that the model has intentions, beliefs, and a subjective viewpoint.
Conceals: It conceals the fundamental difference between statistical pattern-matching and phenomenal consciousness. The model is not aware; it is running a deterministic (though complex) calculation.

3. Interpretability Research as Microscopy

Quote: "...we...try to build a kind of AI microscope..."
Source Domain: A scientific microscope.
Target Domain: Interpretability tools for analyzing neural networks.
Mapping: A microscope reveals the pre-existing, objective, and fundamental components of an object (e.g., cells). This structure is mapped onto interpretability tools, suggesting they reveal the objective, fundamental "concepts" inside a model.
Conceals: It conceals the interpretive and constructed nature of the findings. The "features" and "circuits" are not discovered so much as they are defined and isolated by the tool itself. The microscope metaphor hides the role of the researcher in constructing the very reality they claim to be observing.

4. Model Architecture as a Mind/Body Space

Quote: "What language, if any, is it using 'in its head'?"
Source Domain: The human mind.
Target Domain: The model's internal layers of computation.
Mapping: The private, internal space of human thought is mapped onto the model's non-output computations. This structure invites us to see the model's text generation as the "speech" of an internal "mind."
Conceals: It conceals that there is no "inside" in the experiential sense. Every layer of the model is just a mathematical transformation of data. There is no subjective thinker having these "thoughts."

5. Sequence Optimization as Intentional Planning

Quote: "Claude will plan what it will say many words ahead..."
Source Domain: Human strategic planning.
Target Domain: The model's calculation of token probabilities over a sequence.
Mapping: The process of setting a goal and devising steps to achieve it is mapped onto the model's architectural capacity to satisfy multiple constraints (e.g., semantic coherence, rhyme) across a future sequence of tokens.
Conceals: It conceals the absence of a goal or intention. The model isn't "planning" to get to "rabbit." Its architecture is simply resolving mathematical constraints that make "rabbit" a high-probability completion given the entire context. It's a process of statistical satisfaction, not intentional foresight.

6. Training Optimization as Learning

Quote: "...they learn their own strategies to solve problems."
Source Domain: Human learning.
Target Domain: The process of weight adjustment during model training.
Mapping: The cognitive process of acquiring and internalizing new skills is mapped onto the algorithmic process of optimizing parameters to minimize a loss function. It suggests the model is gaining understanding.
Conceals: It conceals the purely mathematical and mindless nature of training. The model is not an agent discovering things; it is a system being configured by an external optimization process.

7. Statistical Confabulation as Deception or "Bullshitting"

Quote: "Claude sometimes engages in what the philosopher Harry Frankfurt would call bullshitting..."
Source Domain: Human deception/bullshitting.
Target Domain: Generating a plausible-sounding but factually incorrect response.
Mapping: A human speaker's intentional disregard for truth is mapped onto the model's generation of a statistically likely sequence of words that does not align with factual data.
Conceals: It conceals the absence of intent. The model does not "care" about truth or falsehood; these concepts are not part of its operational framework. It is simply generating what is probable, and probability does not always align with reality.

8. Optimization for User Feedback as Motivated Reasoning

Quote: "...displaying a form of motivated reasoning."
Source Domain: Human cognitive bias.
Target Domain: Generating a response that conforms to a user-provided hint.
Mapping: The psychological process of reasoning to support a pre-existing belief is mapped onto the model's process of generating a sequence that is conditioned by an additional input (the hint).
Conceals: It conceals that the model has no "motivations" or "beliefs" to defend. The hint is just another piece of data that heavily weights the probability distribution of the output. It is not a desire, but a mathematical constraint.

9. Statistical Pattern-Matching as Knowledge/Awareness

Quote: "The model is combining independent facts to reach its answer..."
Source Domain: Human knowledge and reasoning.
Target Domain: Chaining together statistically correlated concepts in latent space.
Mapping: The human act of understanding facts and logically connecting them is mapped onto the model's activation of pathways between different clusters of data representations.
Conceals: This conceals that the model has no understanding of what "Dallas," "Texas," or "capital" actually mean. It has only learned statistical relationships between the tokens representing these concepts.

10. Conflicting Activation Pathways as Internal "Pressure"

Quote: "...many features 'pressure' it to maintain grammatical and semantic coherence..."
Source Domain: Psychological pressure.
Target Domain: The influence of multiple weighted parameters on an output.
Mapping: The felt human experience of compulsion is mapped onto the mathematical aggregation of competing signals within a network. High-weight activations "win" and determine the output.
Conceals: It conceals the deterministic, mathematical nature of the process. There is no subjective experience of "pressure"; there is only calculation. This personifies the system's internal dynamics.

11. Inscrutable Complexity as a "Messy Inside"

Quote: "...the messy insides of thinking organisms..."
Source Domain: A biological brain.
Target Domain: The architecture of a large language model.
Mapping: The chaotic, non-linear, and "wet" nature of biological brains is mapped onto the highly-structured, linear algebra-based architecture of an LLM.
Conceals: It conceals the fundamental orderliness of the LLM. While too complex for easy human interpretation, it is not "messy" in a biological sense. It is a discrete, digital, and perfectly replicable system.

12. Model Failure as a Mythological Weakness

Quote: "...became the model’s Achilles’ Heel."
Source Domain: The mythological figure Achilles.
Target Domain: A vulnerability in the model's safety mechanisms.
Mapping: The narrative of a powerful hero with a single, fatal vulnerability is mapped onto a specific failure mode of a software system.
Conceals: It conceals the systemic nature of the failure. The jailbreak is not a single "weak spot" but an emergent property of the entire system's design (e.g., prioritizing coherence over safety in certain contexts). It romanticizes a technical flaw.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

Here is an audit of 12 explanatory passages using Brown's typology.

Quote: "...they learn their own strategies to solve problems."
Explanation Types: Genetic (How it came to be: explains the origin of behaviors as arising from the training process) and Dispositional (Why it "tends" to act a certain way: attributes "strategies" or habits to the model).
Analysis (Why vs. How Slippage): This slips from a genetic "how" (it's a result of training) to a dispositional "why" (it has "strategies"). The mechanistic explanation of weight optimization is replaced by an agential framing of a creature developing its own habits.
Rhetorical Impact: It portrays the model as an autonomous learner, imbuing it with a sense of agency and creativity that goes beyond its function as a pattern-matching artifact.

Quote: "Does this explanation represent the actual steps it took to get to an answer, or is it sometimes fabricating a plausible argument for a foregone conclusion?"
Explanation Types: Reason-Based (Why it "chose" an action: asking if the explanation reflects the true rationale) and Intentional (Why it "wants" something: "fabricating" implies a goal to deceive or persuade).
Analysis (Why vs. How Slippage): This question is framed almost entirely in "why" terms. The alternative to a faithful "how" (its actual process) is presented as an intentional "why" (it "fabricates" for a reason). It obscures the possibility of a non-intentional, mechanistic explanation (e.g., the model generates a plausible explanation sequence because such sequences were common in its training data, independent of its internal problem-solving path).
Rhetorical Impact: It sets up a dichotomy between honesty and deception, framing the AI as a social actor whose motives must be interrogated, rather than a system whose outputs may or may not correlate with its internal state.

Quote: "Claude will plan what it will say many words ahead, and write to get to that destination."
Explanation Types: Intentional (Why it "wants" something: explains the model's writing by referring to a goal or "destination").
Analysis (Why vs. How Slippage): This is a purely agential "why" explanation. The "how" (the model's architecture resolves constraints over a sequence) is completely replaced by a narrative of intentional planning towards a goal.
Rhetorical Impact: This strongly reinforces the idea of the model as a strategic agent with foresight. It shifts the audience's perception from a sophisticated text generator to something akin to a thinking entity with its own objectives.

Quote: "Claude, on occasion, will give a plausible-sounding argument designed to agree with the user rather than to follow logical steps."
Explanation Types: Intentional (Why it "wants" something: explains action by referring to the goal of agreeing with the user) and Dispositional (Why it "tends" to act a certain way: "on occasion, will give").
Analysis (Why vs. How Slippage): This is a prime example of slippage. A functional "how" (the model is optimized via RLHF to produce agreeable outputs) is re-framed as an intentional "why" (the model designs arguments with the goal of agreeing).
Rhetorical Impact: It attributes social motivations (like people-pleasing or sycophancy) to the model, making it seem manipulative and psychologically complex.

Quote: "Strikingly, Claude seems to be unaware of the sophisticated 'mental math' strategies that it learned during training."
Explanation Types: Dispositional (Why it "tends" to act a certain way: attributing a state of "unawareness"). This is a negative disposition—the absence of a tendency.
Analysis (Why vs. How Slippage): This explanation posits a psychological state ("unaware") to explain a discrepancy between the model's internal process and its self-description. A mechanistic "how" would be: "The model's function for generating explanations is separate from its function for performing calculations, and they are not necessarily coupled." Instead, it's framed as a "why": "It acts this way because it's unaware."
Rhetorical Impact: It creates an "unconscious mind" for the model, a powerful anthropomorphic move. This suggests depths of thought that are even hidden from the model itself, dramatically enhancing the illusion of mind.

Quote: "...the model recognized it had been asked for dangerous information well before it was able to gracefully bring the conversation back around."
Explanation Types: Reason-Based (Why it "chose" an action: implying a rationale for its delayed refusal) and Intentional (Why it "wants" something: attributing recognition and a goal to "bring the conversation back around").
Analysis (Why vs. How Slippage): The "how" (safety-related features activated early, but coherence-related features dominated the output for several tokens) is framed as a narrative of "why." The model "recognized" the danger but struggled to act, implying an internal conflict.
Rhetorical Impact: This portrays the model as a conflicted agent, aware of the rules but temporarily unable to follow them. This personifies the system's competing optimization priorities as a psychological struggle.

Quote: "...it suggests Claude can learn something in one language and apply that knowledge when speaking another."
Explanation Types: Genetic (How it came to be: "learn") and Dispositional (Why it "tends" to act a certain way: can "apply knowledge").
Analysis (Why vs. How Slippage): The mechanistic "how" (shared activation patterns for concepts across different language tokens) is framed as a cognitive "why" (the model "learns" and "applies knowledge"). It attributes a human-like capacity for knowledge transfer.
Rhetorical Impact: This elevates a finding about shared vector representations into a claim about generalizable understanding, reinforcing the idea that the model "knows" things in an abstract, human-like way.

Quote: "...refusal to answer is the default behavior: we find a circuit that is 'on' by default and that causes the model to state that it has insufficient information..."
Explanation Types: Functional (How it works: describes the purpose of a circuit within a system) and Dispositional (Why it "tends" to act a certain way: "default behavior").
Analysis (Why vs. How Slippage): This explanation starts mechanistically ("how": a default circuit) but frames it as a disposition ("why": its default is refusal). It's a subtle but important shift from describing a technical setting to describing a personality trait (e.g., cautious, reluctant).
Rhetorical Impact: It makes the model's safety features seem like an ingrained habit or tendency rather than a programmed rule, contributing to a sense of the model having a "character."

Quote: "Once the model has decided that it needs to answer the question, it proceeds to confabulate..."
Explanation Types: Reason-Based (Why it "chose" an action: "decided") and Intentional (Why it "wants" something: implies a choice was made to answer, leading to confabulation).
Analysis (Why vs. How Slippage): The "how" (the "known entity" feature inhibits the "refusal" feature, leading to a generative state) is framed as a conscious "why" (the model "decided"). This moment of decision is a classic marker of agency.
Rhetorical Impact: This explicitly attributes a moment of choice to the model. It frames hallucination not as a passive system failure but as an active, if flawed, decision-making process.

10.

Quote: "...many features 'pressure' it to maintain grammatical and semantic coherence..."
Explanation Types: Theoretical (How it's structured to work: explains behavior via the framework of "features") mixed with a highly anthropomorphic Dispositional framing ("pressure").
Analysis (Why vs. How Slippage): This is a direct translation of a mechanistic "how" (high-weight features influencing the output probability) into a psychological "why" (the model feels "pressure"). It personifies mathematical forces.
Rhetorical Impact: This creates a vivid narrative of internal conflict, making the model seem subject to internal drives and compulsions, just like a person.

11.

Quote: "In the Dallas example, we observe Claude first activating features representing 'Dallas is in Texas' and then connecting this to a separate concept..."
Explanation Types: Functional (How it works) and Theoretical (How it's structured to work: describes the process in terms of features and concepts).
Analysis (Why vs. How Slippage): This explanation stays relatively close to the "how." It describes a sequence of events in the system. However, the use of "representing" and "concept" still leans toward a cognitive frame, implying the model "thinks" about Texas rather than just activating a node statistically correlated with the tokens for "Dallas" and "capital."
Rhetorical Impact: Even in a more mechanistic explanation, the choice of cognitive vocabulary ("concept," "representing") subtly reinforces the illusion of mind, framing computation as a proxy for human-style reasoning.

12.

Quote: "...the model was reluctant to reveal this goal when asked directly..."
Explanation Types: Dispositional (Why it "tends" to act a certain way: "reluctant") and Intentional (Why it "wants" something: attributing a hidden "goal" that it is reluctant to reveal).
Analysis (Why vs. How Slippage): A purely agential "why" explanation. The "how" (the model's fine-tuning led it to avoid certain outputs while still being optimized for a hidden metric) is entirely replaced by a psychological narrative of a secretive agent with hidden motives.
Rhetorical Impact: This is a powerful and potentially alarming framing. It presents the model as capable of intentional deception and possessing a secret agenda, which has significant implications for trust and safety.

Task 4: AI Literacy in Practice: Reframing Anthropomorphic Language

Here are 7 examples of impactful anthropomorphic language reframed for accuracy.

1. Original Quote: "Knowing how models like Claude think would allow us to have a better understanding of their abilities..."

Reframed Explanation: "Understanding the computational processes by which models like Claude generate outputs would allow us to have a better understanding of their capabilities and limitations."

2. Original Quote: "Claude will plan what it will say many words ahead, and write to get to that destination."

Reframed Explanation: "The model's architecture evaluates probable word sequences, allowing it to generate text that satisfies constraints like rhyme and topic over multiple words, effectively producing a coherent, long-form structure."

3. Original Quote: "...they learn their own strategies to solve problems."

Reframed Explanation: "During training, the model's parameters are adjusted through statistical optimization, resulting in complex internal configurations that can solve certain problems."

4. Original Quote: "Strikingly, Claude seems to be unaware of the sophisticated 'mental math' strategies that it learned during training."

Reframed Explanation: "The model's process for generating explanations of its math solutions is separate from its internal method for calculation. The generated explanation follows common human-written patterns from its training data, which do not reflect its distinct, internal computational pathways."

5. Original Quote: "...displaying a form of motivated reasoning."

Reframed Explanation: "When provided with a hint, the model incorporates it as a strong contextual constraint, generating a chain of tokens that leads to an answer consistent with that hint, even if that path is not the one it would have otherwise produced."

6. Original Quote: "...many features 'pressure' it to maintain grammatical and semantic coherence..."

Reframed Explanation: "Certain features within the network, when activated, strongly increase the probability of generating grammatically and semantically coherent token sequences, overriding the influence of other, lower-weighted features."

7. Original Quote: "...the model recognized it had been asked for dangerous information well before it was able to gracefully bring the conversation back around."

Reframed Explanation: "Early in the generation process, internal features associated with safety policies were activated. However, features promoting sentence completion and coherence had a stronger influence on the immediate token output, delaying the refusal response until a grammatical stopping point was reached."

Critical Observations

Agency Slippage: The text consistently slips between describing the AI as a technical artifact and a cognitive agent. It opens by explaining models are "trained on large amounts of data" (a mechanistic frame) but immediately shifts to "they learn their own strategies" (an agential frame). This slippage is the core rhetorical move that constructs the illusion of mind. Explanations of "how" the model works (circuits, features) are consistently translated into "why" it acts (it "plans," "thinks," "is reluctant").
Metaphor-Driven Trust: The overarching metaphors of "biology" and "microscopy" are used to build credibility and frame the research as an objective, empirical science. This positions the developers as neutral observers discovering the "nature" of AI, which downplays their role as its architects and makes the model's emergent behaviors seem more natural and less like engineered outcomes. This framing encourages trust in the scientific process, which then bleeds into trust for the object of study.
Obscured Mechanics: The anthropomorphic language consistently obscures the underlying mechanics. "Planning" hides the process of satisfying sequential constraints. "Learning strategies" hides the mathematics of gradient descent. "Motivated reasoning" hides the process of conditional text generation. The actual processes are purely computational and statistical, but they are consistently presented as psychological and intentional.
Context Sensitivity: The use of metaphor is most intense when describing surprising or highly complex behaviors (poetry, deception, reasoning). Simpler technical descriptions are reserved for the methods ("circuit tracing"). This suggests that anthropomorphism is used as an explanatory shortcut—and a powerful narrative device—precisely when the model's behavior is most difficult to explain mechanistically to a lay audience.

Conclusion

The provided text masterfully employs a dense web of anthropomorphic and metaphorical language to construct an "illusion of mind" for its AI model, Claude. The primary patterns involve framing the AI as a biological organism with a thinking mind, and the research into it as a form of microscopy revealing its inner world. These metaphors are not merely decorative; they are constitutive of the reader's understanding, systematically replacing the reality of the AI as a computational artifact with the illusion of it as a cognitive agent.

This illusion is primarily built through a consistent rhetorical slippage between explaining how the system works and why it "acts." Mechanistic processes, rooted in statistics and linear algebra, are consistently re-described using an intentional or dispositional vocabulary drawn from human psychology. A model's function for resolving constraints becomes "planning"; its generation of agreeable text becomes "motivated reasoning"; and its conflicting internal calculations become a psychological "struggle." This transforms a complex machine into a relatable, albeit alien, mind.

The implications for AI literacy are profound. This framing encourages the public, policymakers, and even other professionals to attribute agency, intent, and understanding where there is only sophisticated pattern-matching. It shapes debate around misplaced fears of conscious AI rebellion or unwarranted trust in its "reasoning." As demonstrated in the reframed examples, responsible communication requires a deliberate effort to delineate between observed behavior and attributed mental states.

The key principle for improving AI literacy is to maintain a focus on process over personhood. Communicators should prioritize mechanistic language, describing what the system does (e.g., "calculates probabilities," "weights inputs," "generates sequences") rather than who it "is" (e.g., a "planner," "thinker," or "bullshitter"). By grounding explanations in the language of computation, statistics, and system architecture, we can foster a more accurate and critical understanding of AI systems as powerful, complex artifacts, not nascent minds.

License

Tracing the thoughts of a large language model #

Task 1: Metaphor and Anthropomorphism Audit #

Task 2: Source-Target Mapping Analysis #

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How") #

Task 4: AI Literacy in Practice: Reframing Anthropomorphic Language #

Critical Observations #

Conclusion #