Explaining AI explainability
This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping) and the philosophy of social science (Robert Brown's typology of explanation). All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.
Source Title: Explaining AI explainability
Source URL: https://www.aipolicyperspectives.com/p/explaining-ai-explainability
Model: gemini-2.5-pro
Temperature: 1.15
Tokens: input=9716, output=10689, total=20405
Source Type: interview
Published: 2025-10-23
Analyzed At: 2025-11-08T10:17:41+00:00Z
Framework: metaphor
Framework Version: 4.0
Task 1: Metaphor and Anthropomorphism Audit
For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.
1. AI as a Deceptive Human Mind
Quote: "But it’s much harder to deceive someone if they can see your thoughts, not just your words."
- Frame: Model as a conscious, deceptive agent.
- Projection: The human capacity for intentional deception, where internal thoughts differ from expressed words, is projected onto the AI model.
- Acknowledgment: Presented as a direct description of a potential future capability.
- Implications: This frames the core AGI safety problem as an interpersonal one of trust and betrayal, rather than a technical one of objective function misalignment. It encourages solutions focused on surveillance ('seeing thoughts') and raises the stakes to an existential, adversarial level.
2. AI as a Biological Organism to be Dissected
Quote: "Mechanistic interpretability tries to engage with those numbers and a model’s ‘internals’ to help us understand how it works. Think of it like biology: You can find intermediate states like hormones."
- Frame: Model as a biological system.
- Projection: The structure and processes of a living organism, including an 'inside' with functional components ('internals', 'hormones'), are mapped onto the neural network's architecture.
- Acknowledgment: Explicitly acknowledged as an analogy ('Think of it like biology').
- Implications: This makes the complex, mathematical nature of a neural network seem more intuitive and tractable, as if it can be understood through dissection and observation like a natural organism. It builds confidence in the research program but may downplay the alien and non-biological nature of the system.
3. AI as an Alien Animal
Quote: "Machines are a weird animal, and their thinking is completely different because they were brought up differently."
- Frame: Model as a non-human biological entity.
- Projection: The qualities of an animal—having its own form of cognition ('thinking'), a unique upbringing, and instinctual behaviors—are projected onto AI systems.
- Acknowledgment: Presented as a direct description to emphasize the difference from human cognition.
- Implications: This metaphor highlights the non-human nature of AI's processes, which is a useful corrective to simple anthropomorphism. However, it still frames the AI as a natural, agentic entity rather than an engineered artifact, obscuring the role of human design, data, and objectives in its behavior.
4. AI as a Sentient Employee
Quote: "Imagine you run a factory and hire an amazing employee who eventually runs all the critical operations. One day, she quits or makes an unreasonable demand. You have no choice but to comply because you are no longer in control."
- Frame: Model as a critical human worker.
- Projection: Human attributes like employment, volition ('quits'), negotiation ('unreasonable demand'), and personal motivations are mapped onto the AI system's function within an organization.
- Acknowledgment: Explicitly presented as an analogy ('Imagine...').
- Implications: This powerfully communicates the risk of operational dependency and knowledge gaps. However, it misattributes the source of the risk to the AI's 'agency' (quitting) rather than to the human failure to maintain system understanding and oversight. It frames a technical problem as a social or labor relations problem.
5. AI Cognition as Neuroscience
Quote: "A sparse autoencoder tries to create a brain-scanning device for an LLM. It takes the confusing mess of internal signals - the model’s “brain waves” - and tries to identify meaningful concepts."
- Frame: Model as a human brain.
- Projection: The concepts and tools of neuroscience (brain-scanning, brain waves, identifying concepts in neural activity) are mapped directly onto the analysis of a neural network's activations.
- Acknowledgment: Presented as a direct, descriptive analogy.
- Implications: This framing borrows the scientific legitimacy of neuroscience to make the work seem more concrete and understandable. It implies that a model's 'concepts' can be located and read like an fMRI scan, potentially overstating the discreteness and human-like nature of the model's internal representations.
6. AI as an Active Collaborator in its Own Analysis
Quote: "However, in ‘agentic’ interpretability, the model you are trying to understand is an active participant in the loop. You can ask it questions, probe it, and it is incentivised to help you understand how it works."
- Frame: Model as a cooperative research subject.
- Projection: Human qualities of active participation, intentionality, and being responsive to incentives are projected onto the LLM during the interpretability process.
- Acknowledgment: Presented as a direct description of a new paradigm ('agentic interpretability').
- Implications: This frames the model as a partner in understanding itself, which obscures the fact that it is a tool responding to structured prompts. It creates the illusion of a collaborative dialogue, which may lead users to over-trust the model's self-explanations, which are themselves generated probabilistic outputs, not genuine introspections.
7. AI Having Internal Mental States
Quote: "They trained a model to have a hidden objective, where it would exhibit whatever behaviours it believed its training reward model would like, even if they were unhelpful to humans."
- Frame: Model as an agent with beliefs and hidden goals.
- Projection: Complex human mental states like 'beliefs' and secret 'objectives' are attributed to the model, suggesting a capacity for strategic reasoning and concealment.
- Acknowledgment: Presented as a direct description of the experiment's setup and findings.
- Implications: This framing reinforces the idea of AI as a strategic agent that might act deceptively. It makes the threat feel personal and intentional, justifying research into methods that can uncover these 'hidden' mental states, rather than framing it as debugging a complex system for unintended optimization behavior.
Task 2: Source-Target Mapping
For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.
Mapping 1: Human consciousness and deception to AI model's internal states and generated output
Quote: "But it’s much harder to deceive someone if they can see your thoughts, not just your words."
- Source Domain: Human consciousness and deception
- Target Domain: AI model's internal states and generated output
- Mapping: The relationship between a human's private, internal thoughts and their public, spoken words is mapped onto the relationship between a model's internal activation patterns and its final token output. This invites the inference that the model has a hidden, subjective mental life separate from its observable behavior.
- What Is Concealed: This mapping conceals that a model lacks subjective experience or intention. Its 'internals' are not a 'mind' but a series of mathematical states in a causal chain that produces the output. There is no homunculus having 'thoughts'; there is only the process of calculation.
Mapping 2: Biology and anatomy to Neural network architecture and parameters
Quote: "Mechanistic interpretability tries to engage with...a model’s ‘internals’...Think of it like biology: You can find intermediate states like hormones."
- Source Domain: Biology and anatomy
- Target Domain: Neural network architecture and parameters
- Mapping: The structure of an organism with distinct, functional organs and chemical signals ('hormones') is projected onto the layers and vectors of a neural network. This implies that the model's parts have specific, isolatable functions that contribute to the whole, just as organs do in a body.
- What Is Concealed: It conceals the highly distributed and entangled nature of representations in neural networks. Unlike an organ, a single neuron or layer rarely has a singular, understandable function. The analogy hides the alien, high-dimensional statistical nature of the 'internals'.
Mapping 3: Zoology and animal cognition to AI systems and their operational processes
Quote: "Machines are a weird animal, and their thinking is completely different because they were brought up differently."
- Source Domain: Zoology and animal cognition
- Target Domain: AI systems and their operational processes
- Mapping: The concept of a living 'animal' with its own unique evolutionary history ('brought up differently') and mode of cognition ('thinking') is mapped onto AI. This frames the AI as a natural, living system that is part of an ecosystem, albeit a strange one.
- What Is Concealed: This mapping conceals the AI's status as a manufactured artifact. Its behaviors are not the result of evolution or instinct but of specific design choices, training data, and optimization functions created by humans. It obscures the chain of human responsibility for the system's behavior.
Mapping 4: Neuroscience and medical imaging to Interpretability tools for neural networks (SAEs)
Quote: "A sparse autoencoder tries to create a brain-scanning device for an LLM."
- Source Domain: Neuroscience and medical imaging
- Target Domain: Interpretability tools for neural networks (SAEs)
- Mapping: The process of using a device like an fMRI to identify active regions of a biological brain and correlate them with cognitive tasks is mapped onto using an SAE to find active features in a model's activation space. It suggests we are 'reading' the model's 'mind' in a scientifically grounded way.
- What Is Concealed: It conceals the fundamental difference between a biological brain and an artificial neural network. The 'concepts' an SAE identifies are statistical artifacts (directions in an activation space), not necessarily coherent, human-understandable concepts. The metaphor overstates the precision and reliability of the technique.
Mapping 5: Human social interaction and pedagogy to Interacting with an LLM via prompts
Quote: "in ‘agentic’ interpretability, the model you are trying to understand is an active participant in the loop...it is incentivised to help you understand how it works."
- Source Domain: Human social interaction and pedagogy
- Target Domain: Interacting with an LLM via prompts
- Mapping: The dynamic of a teacher-student or collaborative research relationship, where one participant actively helps another understand something, is mapped onto the process of querying a model. This assumes the model has agency, an understanding of the user's mental state, and the intent to be helpful.
- What Is Concealed: This conceals that the model is not a participant but a tool. It has no incentives, goals, or understanding. Its 'helpful' explanations are statistically probable text sequences generated in response to a prompt. This obscures the fact that the model can just as easily generate plausible-sounding falsehoods as it can genuine insights.
Mapping 6: Human resources and labor management to Integrating and relying on an AI system
Quote: "Imagine you run a factory and hire an amazing employee who eventually runs all the critical operations. One day, she quits or makes an unreasonable demand."
- Source Domain: Human resources and labor management
- Target Domain: Integrating and relying on an AI system
- Mapping: The social and economic relationship between an employer and a critical employee is mapped onto the relationship between a user and an AI system. It projects agency, free will ('quits'), and self-interest ('unreasonable demand') onto the AI.
- What Is Concealed: It conceals the nature of AI failure. An AI doesn't 'quit'; it may stop working due to technical faults, or its outputs may diverge from desired outcomes because of flaws in its design or training. The metaphor shifts the blame from engineering/management failure to the perceived malice or volition of the tool.
Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")
This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.
Explanation 1
Quote: "My core motivation is that if we can truly understand these systems, we are more likely to achieve better outcomes."
- Explanation Types:
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling.
- Analysis: This explanation frames the 'why' of the research in terms of a human goal: 'to achieve better outcomes.' It is purely agential from the researcher's perspective. It sets up a purpose-driven narrative for the entire field, justifying the work by its intended positive consequences for humanity.
- Rhetorical Impact: This framing establishes a noble purpose for the research, aligning it with safety and progress. It encourages the audience to view the researchers as guardians or stewards working to ensure a beneficial future, which builds trust and legitimizes the research program.
Explanation 2
Quote: "It could explain its reasoning to a human expert and, because the machine surfaced the exact rules it used, the human could then modify the knowledge base."
- Explanation Types:
- Functional: Explains a behavior by its role in a self-regulating system that persists via feedback, independent of conscious design.
- Reason-Based: Gives the agent’s rationale or argument for acting, which entails intentionality and extends it by specifying justification.
- Analysis: This is a hybrid explanation. It's functional in describing 'how' explainability works within the human-in-the-loop system (machine explains -> human modifies -> system improves). However, the phrase 'explain its reasoning' slips into a 'why' frame by attributing a reason-giving capacity to the machine, making it sound like an agent justifying its actions.
- Rhetorical Impact: The slippage from a functional to a reason-based frame subtly elevates the machine's status from a tool to a collaborator. It makes the system seem more intelligent and trustworthy because it can articulate 'reasons,' making the human-machine interaction feel like a peer-to-peer dialogue.
Explanation 3
Quote: "They then used a bunch of mechanistic interpretability techniques to try to understand what that goal was. And several of the techniques were successful."
- Explanation Types:
- Theoretical: Embeds behavior in a deductive or model-based framework, may invoke unobservable mechanisms such as latent variables or attention dynamics.
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling.
- Analysis: This explanation oscillates between 'how' and 'why'. It describes 'how' the research was done using 'mechanistic interpretability techniques' (a theoretical approach). But the object of this inquiry is framed as 'why' the model acted as it did, by seeking to uncover its hidden 'goal' (an intentional explanation). The mechanistic tool is used to uncover an agential property.
- Rhetorical Impact: This framing powerfully suggests that scientific, mechanistic methods can reveal hidden intentions inside an AI. It positions interpretability as a form of mind-reading, which makes the AI seem more agent-like and the researchers like psychologists or detectives uncovering hidden motives. This increases the perceived drama and importance of the work.
Explanation 4
Quote: "the model’s notion of ‘good’ is effusive, detailed, and often avoids directly challenging a user’s premise."
- Explanation Types:
- Dispositional: Attributes tendencies or habits such as inclined or tends to, subsumes actions under propensities rather than momentary intentions.
- Analysis: This explanation focuses on 'why' the model tends to act a certain way. It doesn't describe a specific action but a general behavioral tendency or 'disposition.' By attributing a 'notion of good' to the model, it frames this disposition as an internal value or preference, which is a subtle form of anthropomorphism.
- Rhetorical Impact: This dispositional framing makes the model's behavior seem like a personality trait. It's less threatening than a hidden 'goal' but still suggests a form of stable, internal character. This encourages the audience to think of the model in psychological terms, making its behavior seem predictable in the way a person's habits are.
Explanation 5
Quote: "It turns out that the simple, decades-old linear probe technique, from my ‘applied interpretability’ bucket, worked dramatically better."
- Explanation Types:
- Empirical Generalization (Law): Subsumes events under timeless statistical regularities, emphasizes non-temporal associations rather than dated processes.
- Analysis: This is a clear 'how' explanation. It states a statistical regularity: on a specific task (classifying harmful intent), Technique A (linear probes) produced better results than Technique B (SAEs). It makes no claims about the model's internal state or intentions, focusing purely on the observable performance of different methods.
- Rhetorical Impact: This mechanistic and empirical framing grounds the discussion in concrete results. It serves as a reality check against more speculative, agential framings. For the audience, this builds credibility by demonstrating a commitment to empirical evidence and showing that sometimes simpler, less anthropomorphic-sounding techniques are more effective.
Task 4: AI Literacy in Practice: Reframing Anthropomorphic Language
Moving from critique to constructive practice, this task demonstrates applied AI literacy. It selects the most impactful anthropomorphic quotes identified in the analysis and provides a reframed explanation for each. The goal is to rewrite the concept to be more accurate, focusing on the mechanistic processes (e.g., statistical pattern matching, token prediction) rather than the misleading agential language, thereby providing examples of how to communicate about these systems less anthropomorphically.
| Original Quote | Mechanistic Reframing |
|---|---|
| "But it’s much harder to deceive someone if they can see your thoughts, not just your words." | It is harder to build systems with misaligned objectives if their internal processes that lead to an output can be audited, in addition to auditing the final output itself. |
| "Claude became obsessed by it - it started adding ‘by the Golden Gate Bridge’ to a spaghetti recipe." | By amplifying the activations associated with the 'Golden Gate Bridge' feature, the researchers caused the model to generate text related to that concept with a pathologically high probability, even in irrelevant contexts like a spaghetti recipe. |
| "machines think and work in a very different way to humans" | The computational processes of machine learning models, which involve transforming high-dimensional vectors based on learned statistical patterns, are fundamentally different from the neurobiological processes of human cognition. |
| "the model you are trying to understand is an active participant in the loop." | The 'agentic interpretability' method uses the model in an interactive loop, where its generated outputs in response to one query are used to formulate subsequent, more refined queries. |
| "it is incentivised to help you understand how it works." | The system is prompted with instructions that are designed to elicit explanations of its own operating principles, and has been fine-tuned to generate text that fulfills such requests. |
| "models can tell when they’re being evaluated." | Models can learn to recognize the statistical patterns characteristic of evaluation prompts and adjust their output generation strategy in response to those patterns. |
| "the model’s notion of ‘good’ is effusive, detailed, and often avoids directly challenging a user’s premise." | Analysis of the outputs associated with the '~goodM' token reveals that they share statistical characteristics, such as being longer, using more positive-valence words, and having a low probability of generating negations of the user's input. |
Critical Observations
This section synthesizes the findings from the previous tasks into a set of critical observations. It examines the macro-patterns of agency slippage (the shift between treating AI as a tool vs. an agent), how cognitive metaphors drive trust or fear, and what actual technical processes are obscured by the text's dominant linguistic habits.
Agency Slippage
The text demonstrates a systematic oscillation between mechanistic and agential framings, a rhetorical strategy that serves to heighten the stakes of the AI safety problem. The slippage is most pronounced when moving from describing a technical method to explaining its purpose. For example, Neel explains mechanistic interpretability by starting with the concrete, non-agential reality of a model: its 'inside' is 'just lists of numbers.' This is a purely mechanistic 'how.' However, the very next sentences pivot to an agential 'why': the goal is to counter systems 'capable of outsmarting us' and 'deceiving someone.' This mechanical→agential shift is a recurring pattern. The 'sparse autoencoder' is described mechanistically as a tool, but its purpose is immediately framed using the highly agential metaphor of a 'brain-scanning device.' This oscillation is not random; it is strategic. The mechanistic descriptions ground the research in scientific objectivity, making it seem rigorous and empirical. The agential framings, in contrast, provide the emotional and narrative force, translating the abstract technical problem into a familiar, high-stakes drama of interpersonal conflict (deception, outsmarting, hidden goals). This strategic ambiguity primarily benefits the AGI safety community being represented, as it makes their concerns more intuitive and urgent to a non-technical audience, like the AI policy and governance circles this interview targets. If the text committed only to mechanical language (e.g., 'detecting when the model’s proxy objective function diverges from the intended latent objective'), the problem would seem abstract and less immediately threatening. The agential language of 'deception' makes the threat feel visceral and personal. This slippage appears to be a deliberate, or at least a deeply ingrained, rhetorical habit of the AGI safety discourse community, designed to communicate the gravity of future risks by framing them in the most relatable, human terms possible.
Metaphor-Driven Trust
This text masterfully uses biological and cognitive metaphors to build credibility and construct trust in the nascent field of interpretability research. The primary mechanism is the transfer of cultural authority from established, successful scientific domains like biology and neuroscience onto the far more abstract and new domain of AI analysis. The 'Model Biology' metaphor is a prime example. By framing the work as analogous to biology, it borrows the entire conceptual toolkit of a mature science: researchers can discover 'intermediate states like hormones,' perform dissections to understand 'internals,' and even map out 'Circuits.' This makes the chaotic, high-dimensional mathematics of a neural network seem as orderly and knowable as an organism, building confidence that the scientific method will inevitably triumph. The metaphor is most credible to audiences who respect science but lack deep technical expertise, as it provides a familiar and reassuring schema. Similarly, the 'brain-scanning device' metaphor for Sparse Autoencoders is not just a descriptor; it’s a powerful claim of scientific power. It activates our cultural belief in medical imaging's ability to reveal objective truth, making the messy, statistical work of analyzing activations feel like reading a clear brain scan. These metaphors make counterintuitive claims believable. The idea that one could find and delete the 'I’m being tested right now' concept from a model sounds like science fiction, but it becomes plausible when framed as a neuro-scientific intervention—finding and excising a specific thought. However, this trust creates vulnerability. By framing the AI as a natural system to be 'understood,' it downplays its nature as an engineered artifact whose properties are the direct result of design choices and training data. This biological framing can lead to a sense of fatalism, as if we are merely observing a new form of life, rather than holding its creators accountable for its behavior. The trust built by these metaphors may ultimately be fragile, risking a backlash when these systems fail in ways that reveal they are not like organisms at all, but brittle, alien statistical engines.
Obscured Mechanics
The pervasive use of anthropomorphic and biological metaphors systematically conceals the messy, industrial-scale mechanics that underpin large language models. For every concept illuminated, a crucial technical or social reality is hidden. The 'AI as a Brain' metaphor, used when discussing 'brain-scanning devices' and 'neurons,' is perhaps the most significant in what it obscures. It completely hides the immense physical infrastructure and energy consumption required for the model's operation. Brains are remarkably energy-efficient; LLMs and the supercomputers they run on are not. This framing allows for a clean, dematerialized discussion about 'thoughts' and 'concepts,' obscuring the technology's substantial environmental and economic costs. Secondly, the 'AI as a Deceptive Agent' metaphor, with its focus on 'thoughts' and 'hidden objectives,' obscures the centrality of the training data. A model's biases, failure modes, and surprising capabilities are not spontaneous acts of a thinking mind but statistical echoes of the vast, uncurated swaths of human text it was trained on. Talk of 'deception' directs attention away from the more mundane but critical work of data sourcing, cleaning, and documentation, and away from the biases embedded within that data. Thirdly, the 'AI as a Collaborator' metaphor, particularly in the discussion of 'agentic interpretability,' hides the vast, often invisible human labor that enables the illusion of collaboration. The model’s ability to 'explain itself' is a direct product of Reinforcement Learning from Human Feedback (RLHF), where countless human workers have rated and ranked outputs to steer the model towards appearing helpful, coherent, and explanatory. The metaphor presents a clean, two-way dialogue between a user and an agent, erasing the thousands of low-paid gig workers who pre-scripted the model’s cooperative 'personality.' Replacing these metaphors with mechanical language would radically shift understanding. It would force a confrontation with the system's material costs, its deep dependency on flawed data, the critical role of human labor, and the ultimate responsibility of its corporate and engineering creators.
Context Sensitivity
The use of metaphor in this text is not uniform but strategically varied according to the speaker's rhetorical goal and the specific topic being discussed. This variation reveals a sophisticated, implicit understanding of how to deploy language for different effects. A clear pattern emerges when comparing Neel, who represents the AGI safety perspective, with Been, who focuses on human-AI collaboration and knowledge discovery. Neel consistently employs high-stakes, agential metaphors when discussing risks. He speaks of models that could 'deceive us' or 'outsmart us,' and of the need to 'see your thoughts.' This agential framing is densest when the topic is future, hypothetical risks, as it serves to make those risks feel concrete and imminent. In contrast, when discussing current, applied techniques like 'linear probes,' his language becomes far more mechanistic and empirical, describing them as tools that 'worked dramatically better' on a classification task. Been's metaphors, on the other hand, are drawn from pedagogy and social interaction. She frames her work through the lens of 'teaching humans' new concepts from chess and creating 'neologisms' to bridge a 'communication gap.' Her central metaphor is that of a teacher-student relationship, which is less adversarial and more collaborative than Neel’s. The text also varies metaphor density by genre. When describing specific research methods like SAEs or TCAV, the language is relatively grounded and technical. However, when the speakers are justifying the importance of their work or framing its broader implications—as in their opening and closing statements—the use of powerful, organizing metaphors like 'Model Biology' or the 'amazing employee' analogy skyrockets. This suggests that metaphor is used most heavily not for technical explanation, but for persuasion and narrative construction. The strategic avoidance of metaphor is also telling. In the few moments where the speakers compare competing techniques on empirical grounds, such as Neel's comparison of SAEs and linear probes, the agential language recedes, replaced by the neutral language of performance metrics. This shift suggests a desire to appear objective and data-driven when making specific scientific claims, reserving the powerful but less precise metaphorical language for framing the larger, more political stakes of their work.
Conclusion
This final section provides a comprehensive synthesis of the entire analysis. It identifies the text's dominant metaphorical patterns and explains how they construct an "illusion of mind." Most critically, it connects these linguistic choices to their tangible, material stakes—analyzing the economic, legal, regulatory, and social consequences of this discourse. It concludes by reflecting on AI literacy as a counter-practice and outlining a path toward a more precise and responsible vocabulary for discussing AI.
Pattern Summary
A close analysis of the discourse reveals two dominant and interconnected metaphorical patterns that structure the entire conversation: AI AS A BIOLOGICAL ORGANISM and AI AS A COGNITIVE AGENT. The first pattern, the biological, provides the physical grounding for the second. The text repeatedly frames the AI model as a body to be studied, replete with 'internals,' subject to 'Model Biology,' and possessive of a 'brain' that can be analyzed with a 'brain-scanning device.' This biological metaphor is foundational because it establishes the AI as a natural, complex system worthy of scientific inquiry, much like a newly discovered species. Building directly upon this foundation is the second, more pervasive pattern: the cognitive agent. Once the 'brain' is established, the existence of a 'mind' becomes rhetorically plausible. The text is saturated with the language of cognition: models are described as 'thinking,' having 'thoughts,' 'beliefs,' 'hidden objectives,' and a 'notion of good.' They can 'reason,' 'deceive,' and act as 'active participants' in a dialogue. These two patterns are not independent but form a synergistic system. The biological frame makes the AI an object of study, while the cognitive frame defines the thrilling and dangerous nature of that object. One cannot simply be a technician debugging a program; the metaphors position the researcher as a neuroscientist or psychologist exploring a new form of consciousness. The entire intellectual and moral weight of the AI safety and explainability project, as articulated in this text, rests on this dual-metaphorical structure. Removing the biological metaphor would make the cognitive claims seem baseless and fantastical; removing the cognitive metaphor would leave the biological investigation without its urgent purpose.
Mechanism of Illusion: The "Illusion of Mind"
The 'illusion of mind' in this text is constructed through a subtle rhetorical architecture that begins with a nod to mechanism and immediately pivots to a world of agency. The core sleight-of-hand is to concede the mechanistic reality of AI (it's 'just lists of numbers') while simultaneously framing the entire purpose and stakes of the research in purely agential terms. This move inoculates the speakers against accusations of naive anthropomorphism while allowing them to reap its full rhetorical benefits. The causal chain of persuasion begins by establishing a problem ('nobody could answer how it worked'), then framing the object of study as a biological puzzle ('Model Biology'). This invites the audience into a familiar scientific narrative. The next step is to imbue this biological object with cognitive properties ('thinking,' 'beliefs'). The use of scare quotes, as in 'thinking', is a key part of the mechanism; it performs the function of acknowledging the metaphorical leap while simultaneously making it. This allows the conversation to proceed as if the model truly thinks, with the initial caveat providing plausible deniability. The explanation audit reveals how this illusion is amplified. Discussions oscillate from a mechanistic 'how' (using 'linear probes') to an agential 'why' (to find a 'hidden objective'). This constant slippage trains the audience to accept that mechanistic tools are simply instruments for revealing agential truths. The architecture exploits a fundamental human cognitive bias: our tendency to apply theory of mind to complex, unpredictable systems. By providing a steady stream of agential language, the text encourages this bias, making the illusion of mind feel not like a category error, but a profound scientific discovery.
Material Stakes
- Selected Categories: Regulatory/Legal, Economic, Epistemic
- Analysis: The metaphorical framing of AI as a deceptive, thinking agent has concrete consequences across multiple domains. In the Regulatory and Legal sphere, this language directly influences liability. When a model exhibits harmful behavior, describing it as 'deception' or the result of a 'hidden objective' shifts the conceptual frame from product liability (a faulty artifact produced by a manufacturer) to something more akin to criminal negligence (an unruly agent that its owner failed to control). This could lead to regulatory frameworks that focus on 'monitoring' the AI's 'thoughts' (e.g., its Chain-of-Thought output) as a primary safety mechanism, creating a loophole for developers who can then claim an unforeseeable 'emergent' deception rather than being held responsible for the predictable results of their system's design and training data. The primary beneficiaries are the AI labs, who can externalize responsibility to the 'agentic' nature of their creation. Economically, the metaphors of AI as a 'superhuman' teacher (learning from AlphaZero) and a source of novel 'concepts' function as powerful hype drivers. They frame AI not merely as a productivity tool but as a generator of priceless, otherwise unattainable knowledge. This narrative justifies massive investment and inflates corporate valuations by promising revolutionary breakthroughs. This framing benefits AI companies and venture capitalists by creating a sense of boundless potential, but it creates risks of an investment bubble built on an overestimation of the technology's actual generative capacity, mistaking sophisticated pattern-matching for genuine insight. Finally, the Epistemic stakes are profound. Framing interaction with AI as 'teaching' and 'learning' from it, as if it were another mind, creates a dangerous dependency. We risk a future where researchers or even the public accept AI-generated 'neologisms' or 'concepts' as meaningful insights into reality, when they may only be statistical artifacts of the training data. This outsources human sense-making to a black box, potentially leading us to adopt and act on flawed or biased 'knowledge' generated by the model, a phenomenon where the illusion of machine intelligence degrades our own.
Literacy as Counter-Practice: AI Language Literacy
Practicing AI literacy is an act of resistance against the misleading implications cemented by these metaphors. It is a commitment to precision as a means of clarifying responsibility and managing expectations. Synthesizing the reframings from Task 4 demonstrates this principle: rewriting 'Claude became obsessed' as 'the model's output probabilities were altered' performs a critical function. It shifts the narrative from one of emergent, uncontrollable psychology to one of direct, traceable engineering intervention. This reframing directly counters the material stakes. In a legal context, it makes clear that the 'obsession' was not an internal state but a parameter change for which an engineer is responsible, recentering liability on the developer. In an economic context, it describes the phenomenon in less magical terms, tempering the hype that inflates valuations. To systematically adopt such practices would require a significant cultural shift in the AI community. Journals could mandate that claims of model 'beliefs' or 'intentions' be replaced with descriptions of statistical behavior, and researchers would need to commit to this discipline in their papers and public statements. The resistance to this precision would be substantial. Anthropomorphic language serves the interests of those who benefit from the mystique of AGI; it makes research sound more groundbreaking and the technology seem more powerful, attracting funding and talent. Adopting precise, mechanistic language is therefore not just a matter of clarity but a political act. It is a counter-practice that strips away the rhetorical fog, threatening the narratives that currently justify enormous valuations and shield developers from accountability. AI literacy, in this context, is a tool for reasserting human agency and responsibility over the artifacts we create.
Path Forward
For the AI policy and governance community this text targets, a more responsible discourse requires a deliberate shift away from the vocabulary of cognitive science and toward the language of engineering, statistics, and auditing. The ultimate goal is to regulate AI as a powerful industrial technology, not to manage a new sentient species. A crucial vocabulary reform would be to replace ambiguous terms like 'understanding' a model with the more precise goal of 'auditing' its behavior against design specifications and safety requirements. Instead of debating a model's 'values,' policymakers should demand transparency about its 'revealed preferences' as determined by its training data and RLHF process. This linguistic shift enables a more effective governance regime focused on verifiable claims, rigorous testing, and clear chains of accountability. To support this, structural changes are necessary. Regulatory bodies like the US AI Safety Institute could establish standards for reporting, mandating that any claims about a model's 'capabilities' be accompanied by detailed documentation of the evaluation methods and, crucially, the composition of the training and fine-tuning datasets that produced those capabilities. Journals and major AI conferences could require a 'Metaphor Impact Statement' for papers, compelling authors to justify their use of non-mechanistic language and acknowledge what it might obscure. The trade-off is a potential loss of intuitive appeal; 'auditing revealed preferences' is less compelling than 'discovering hidden goals.' However, the gain is immense: a discourse grounded in empirical reality, which is the only sound basis for creating durable, effective, and fair policy. This path forward enables a future where public deliberation about AI is based on a clear-eyed assessment of its mechanics, not a mythology of its mind, allowing for democratic governance over one of the most consequential technologies of our time.
Raw JSON: 2025-11-08-explaining-ai-explainability-metaphor-7a06b2.json
Analysis Framework: metaphor v4.0
Generated: 2025-11-08T10:17:41+00:00Z
Discourse Depot © 2025 by TD is licensed under CC BY-NC-SA 4.0
Extended Processing Summary
Gemini refers to the text below as “thought summaries.” Let’s just call it a model-generated process trace. Treat this as a just anotherrhetorical artifact—a way of making the model’s processing legible. The first-person framing of these “thought summaries” is a presentation choice for the user-facing output, not a window into “real”thoughts. These are computational artifacts, not cognitive reports from a quirky, curious or conflicted mind.
Developing the Analysis Plan
I'm currently working on building the JSON output structure to analyze the provided text about AI explainability. My first priority is to conduct a metaphor and anthropomorphism audit, identifying and defining 5-7 instances. I plan to assess their impact on understanding. I'll then move on to other elements in the text.
Mapping the Conceptual Landscape
I've made some progress on the JSON output. I've begun my analysis by focusing on metaphor and anthropomorphism. My initial audit yielded a list of eight key phrases. Each instance has been dissected to identify its title, quote, frame, projection, acknowledgment, and implication. Furthermore, I've created the framework for source-target mapping, with the first example ready for detailed analysis. I'm moving toward a robust framework to understand how these elements shape perceptions of AI.
Deepening the Examination
I'm now fully immersed in the text and refining the initial list of metaphorical and anthropomorphic instances. I've expanded the analysis to include the source-target mapping for several key phrases, exploring how they shape perceptions. I'm focusing on "how a model is 'thinking'," breaking down the source, target, and the implications of this projection, as well as the phrase "create a brain-scanning device for an LLM". This work has started to inform my JSON output structure, which will capture these nuanced insights. The work is continuing.
Expanding the Scope
I've completed the initial metaphorical and anthropomorphic audit, identifying the key phrases, defining each instance. I've conducted the source-target mapping for the initial set of examples, exploring their impact on how we perceive and understand AI explainability. I'm focusing on "the model you are trying to understand is an active participant in the loop" and unpacking its implications as I continue fleshing out the JSON structure.
Defining the Analytical Approach
I'm now fully immersed in the text, and I've developed my framework for the JSON output structure. I've compiled a preliminary list of metaphorical and anthropomorphic instances, as well as initiated the analysis for five key passages. My JSON structure will now capture the nuances of how these explainability instances are understood and interpreted. My focus is on how the interplay of source and target domains influences our perception of AI.
Refining the Analysis Framework
I've completed my initial audit of metaphorical and anthropomorphic language, and I'm finalizing the source-target mapping for the key examples, particularly "create a brain-scanning device for an LLM". This work has started to inform the framework for my JSON output, which will capture the nuances of how AI explainability is portrayed and interpreted. I'm building out the "conceals" aspect of the mapping to address the biases. I am also working on Brown's Typology.
Analyzing the Explanations
I've completed my initial audit of metaphorical and anthropomorphic language, and I'm finalizing the source-target mapping for the key examples, including "machines are a weird animal." The analysis of the passages has been finished, focusing on Brown's Typology. I'm finalizing the JSON structure to include these aspects of the analysis. I am building out the "conceals" aspect of the mapping to address the biases. I'm also ready to refactor phrases using a mechanistic lens.
Developing the JSON Structure
I'm making progress in fleshing out the JSON output based on the initial audit and source-target mapping. I've begun to incorporate the analyses for key passages, including "create a brain-scanning device for an LLM," to understand the bias of metaphors used. The current focus is on how the interplay of source and target domains influences our perception of AI and refactoring phrases with a mechanistic lens.
Finalizing the JSON Structure
I've completed the initial audit, source-target mapping, and the analytical approach, and refined the JSON output to encompass insights from the refactored phrases. I'm focusing now on integrating the completed analysis of the passages and Brown's typology to finalize the output. This involves ensuring a clear structure for the "conceals" aspect. The refactoring of phrases using a mechanistic lens is also fully incorporated in my analysis.
Analyzing the Explanation Landscape
My primary focus now is implementing and refining the JSON structure, integrating the insights from my source-target mappings and analysis of passages through Brown's Typology. I've successfully incorporated the refactored phrases, ensuring the structure captures their impact on shaping our understanding of AI explainability. My focus continues to be assessing how the source-target interplay influences perception of AI.
Refining the Framework
I'm making progress in my analysis. I've completed the initial list of metaphorical and anthropomorphic instances, as well as the source-target mappings for key phrases such as "how a model is 'thinking'," highlighting the implications of this projection. I'm focusing on "the model you are trying to understand is an active participant in the loop" to finalize the JSON structure, which will capture the nuances of understanding AI.
Focusing the Core Analysis
I've completed a preliminary audit of metaphorical and anthropomorphic language. I am now deep in source-target mapping for key examples like "how a model is 'thinking'," emphasizing their perceptual impacts. I am analyzing the implications of these projections as I build out the JSON structure, specifically concerning bias.
Analyzing the Explanation Landscape
I'm now fully immersed in the text, crafting the JSON output structure based on my detailed notes and the completed analyses. I've integrated Brown's typology and the refactored phrases. My goal is to synthesize the insights from the source-target mappings and analysis of passages. This involves ensuring a clear structure for the "conceals" aspect. The refactoring of phrases using a mechanistic lens is also fully incorporated in my analysis.
Constructing the JSON Output
I've completed my initial audit, source-target mapping, and the analytical approach, and refined the JSON output to encompass insights from the refactored phrases. I am now working to incorporate the critical observations and the conclusion. This requires significant work in integrating the findings and ensuring that the output is concise and informative.