When Models Know More Than They Say: Probing Analogical Reasoning in LLMs
- About
- Analysis Metadata
- 📊 Audit Dashboard
This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.
All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.
Source Title: When Models Know More Than They Say: Probing Analogical Reasoning in LLMs Source URL: https://arxiv.org/abs/2604.03877v1 Model: gemini-3.1-pro-preview Temperature: 1.05 Top P: 0.95 Tokens: input=12881, output=17433, total=30314 Source Type: article Published: 2026-04-04 Analyzed At: 2026-05-03T11:26:13.649Z Framework: metaphor Framework Version: 6.5 Schema Version: 3.0 Run ID: 2026-05-03-when-models-know-more-than-they-say-prob-metaphor-aix46b
Metaphor & Illusion Dashboard
Anthropomorphism audit · Explanation framing · Accountability architecture
Deep Analysis
Select a section to view detailed findings
Explanation Audit
Browse how/why framing in each passage
"Pretrained models can encode latent information about entities and relations without explicit supervision (Li et al., 2021), and prompting strategies like chain-of-thought (CoT) have been used as evidence that LLMs can perform reasoning-like operations."
🔍Analysis
🧠Epistemic Claim Analysis
🎯Rhetorical Impact
How/Why Slippage
20%
of explanations use agential framing
2 / 10 explanations
Unacknowledged Metaphors
75%
presented as literal description
No meta-commentary or hedging
Hidden Actors
75%
agency obscured by agentless constructions
Corporations/engineers unnamed
Explanation Types
How vs. Why framing
Acknowledgment Status
Meta-awareness of metaphor
Actor Visibility
Accountability architecture
Source → Target Pairs (8)
Human domains mapped onto AI systems
Metaphor Gallery (8)
Reframed Language Samples
| Original Quote | Mechanistic Reframing | Technical Reality | Human Agency Restoration |
|---|---|---|---|
| When Models Know More Than They Say | When the internal mathematical weights of a model contain linearly separable statistical patterns that its autoregressive generation pipeline fails to output as text. | Models do not possess justified belief (knowing) or intentional communication (saying). Mechanistically, researchers can train external classifiers to find high-dimensional spatial correlations in the model's hidden layers that the model's own next-token prediction function does not heavily weight during output generation. | N/A - describes computational processes without displacing responsibility. |
| they struggle in cases where an analogy is not apparent on the surface | The models fail to output statistically correlated token sequences when the testing benchmark lacks the structural text adjacencies present in their training data. | Algorithms do not experience subjective exertion or 'struggle'. Mechanistically, when a prompt lacks surface-level textual overlap with its training distribution, the attention mechanism cannot calculate high-probability pathways to generate the human-expected analogical output. | N/A - describes computational processes without displacing responsibility. |
| assessing whether LLMs acquire the competencies that support narrative understanding | Assessing whether human engineers have successfully designed training objectives that force LLMs to mathematically encode structural features of human narratives. | LLMs do not experience conscious awareness or 'understanding'. Mechanistically, the model classifies and processes token embeddings, continually adjusting internal weights during training to minimize prediction error across a vast corpus of narrative text. | Engineers at companies like Meta and OpenAI actively select the datasets and design the reinforcement learning pipelines that determine which statistical features these models encode. |
| do LLMs internalize typological structures... or are they simply leveraging surface-level correlations | Do transformer architectures encode highly distributed, multi-layer geometric representations of text structures, or do their outputs rely predominantly on localized N-gram and syntactical probabilities? | A matrix of parameters cannot 'internalize' knowledge into a cognitive framework. Mechanistically, the system dynamically calculates token probabilities. The question is whether its attention heads operate on deep, abstracted feature spaces across many layers or heavily weight immediate, adjacent token pairs. | N/A - describes computational processes without displacing responsibility. |
Task 1: Metaphor and Anthropomorphism Audit
About this task
For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.
V3 Enhancement: Each metaphor now includes an accountability analysis.
1. Cognition as Biological Process
Quote: "assessing whether LLMs acquire the competencies that support narrative understanding"
- Frame: Model as developing biological organism
- Projection: The metaphor maps the human developmental process of cognitive acquisition and subjective comprehension onto a statistical system. By using the verbs 'acquire' and 'understand', the text projects conscious awareness and developmental learning onto a static computational artifact. The system does not 'acquire' competencies through lived experience; its weights are updated via gradient descent during a fixed training phase. It does not 'understand' narratives; it processes token sequences to optimize for statistical probability based on its training distribution. This projection invites the reader to imagine the LLM as a student or child growing in its grasp of the world, rather than a matrix of parameters being adjusted to minimize a loss function. It substitutes mechanical processing and classification for the rich, subjective state of conscious knowing.
- Acknowledgment: Direct (Unacknowledged) (The claim is presented as a straightforward research objective without scare quotes or hedging. I considered 'Hedged/Qualified' because a later sentence uses 'reasoning-like', but this specific foundational sentence treats 'acquire' and 'understanding' as literal, unvarnished capabilities of the system.)
- Implications: Framing LLMs as acquiring human-like understanding inflates perceived capability and shifts the burden of trust. If a model 'understands', users are far more likely to extend relation-based trust to it, assuming it grasps nuance, context, and ethical boundaries in a human way. This creates severe liability ambiguity: when the model generates toxic or incorrect analogies, the failure is interpreted as a temporary lapse in 'understanding' rather than a fundamental absence of semantic grounding, leading policymakers to under-regulate the deployment of such statistical pattern-matchers in critical domains.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: The construction 'LLMs acquire' places the technological artifact as the sole grammatical and causal agent of the action. This entirely obscures the massive engineering teams, data scrapers, and corporate executives at organizations like OpenAI and Meta who actively designed the training objectives, selected the massive datasets, and deployed the systems. I considered 'Partial (some attribution)' because the authors refer to 'assessing' (implying researchers), but the actual acquisition of capabilities is attributed entirely to the LLM. If human actors were named, we would ask what specific data Meta included to force these statistical correlations.
Show more...
2. Epistemic Dualism in AI Capabilities
Quote: "When Models Know More Than They Say"
- Frame: Model as conscious subject with an inner life
- Projection: This framing maps the human psychological dichotomy of internal thought and external communication onto the architecture of a neural network. It projects the capacity for justified true belief ('knowing') and intentional communicative acts ('saying') onto mathematical representations. By asserting the model 'knows' things it does not articulate, the text constructs an illusion of a divided consciousness or a subconscious mind. It treats the linear separability of activation patterns in hidden layers as equivalent to human epistemic possession, while treating output generation as a deliberate, possibly restrained, communicative choice. This completely erases the mechanistic reality that both the internal layers and the output layers are deterministic mathematical operations devoid of awareness or withholding intent.
- Acknowledgment: Direct (Unacknowledged) (The title serves as the definitive, unhedged thesis of the paper. I considered 'Explicitly Acknowledged' since titles sometimes use rhetorical shorthand, but the entire methodology subsequently literalizes this dichotomy without ever deconstructing the epistemic verbs, treating 'knowing' as a literal property.)
- Implications: This consciousness projection drastically inflates the perceived sophistication of the model, transforming it into a mysterious, almost mystical entity containing hidden depths. When researchers and the public believe models 'know' more than they 'say', it fosters an unwarranted assumption of latent superintelligence. It drives narratives that models are withholding information or possessing a coherent, grounded worldview that simply needs the right prompt to be unlocked, rather than recognizing that statistical correlations simply exist at different degrees of separability within the network's high-dimensional space.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: This framing positions the 'Models' as independent, secretive epistemic agents. The human engineers who created the specific prompt-tuning pipelines (like RLHF) that cause the divergence between internal representation and final output are entirely erased. I considered 'Ambiguous' because it's a title, but the grammatical structure clearly assigns sole agency to the artifact. If we named the actors, we would state that Meta's alignment algorithms mathematically filter out certain token paths, holding the corporation accountable for the gap between structural encoding and generated text.
3. AI as Struggling Problem-Solver
Quote: "they struggle in cases where an analogy is not apparent on the surface"
- Frame: Model as effortful cognitive agent
- Projection: This metaphor maps the subjective, experiential phenomenon of human intellectual exertion onto computational processing failures. The verb 'struggle' implies conscious effort, frustration, and a desire to achieve a goal despite obstacles. By claiming the LLM 'struggles', the text attributes an agential striving to an algorithmic process that is simply calculating token probabilities based on matrix multiplications. A neural network does not experience difficulty or expend conscious effort; it merely yields low statistical confidence or incorrect outputs when presented with data distributions that diverge from its training set. This projection substitutes the mechanistic reality of sparse training coverage with an agential narrative of a sentient being trying its best.
- Acknowledgment: Direct (Unacknowledged) (The word 'struggle' is deployed as a standard, literal description of model performance in the abstract. I considered 'Hedged/Qualified' because technical papers often mean 'perform poorly' by 'struggle', but the text includes no explicit qualification here, leaving the anthropomorphic verb to operate as a literal description.)
- Implications: Describing an AI as 'struggling' evokes human empathy and masks the mechanical rigidity of the system. It suggests the model is generally competent but temporarily hindered by a tough problem, fostering an unwarranted tolerance for systemic errors. This framing encourages users and policymakers to treat algorithmic failures as relatable, human-like mistakes rather than profound, unacceptable statistical brittleness. It shifts the discourse away from 'this product is fundamentally defective for this use case' to 'the AI is trying to figure it out,' delaying necessary regulatory or engineering interventions.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: The active subject is 'they' (the LLMs). The model is positioned as an autonomous agent failing a test. I considered 'Partial' since the testing environment implies researchers, but the failure itself is localized entirely within the machine's supposed cognition. This obscures the responsibility of the developers who released a system that cannot reliably compute abstract analogies. If the actors were named, the sentence would clarify that the developers failed to train the system on sufficiently abstract data structures to enable statistical generalization.
4. Algorithmic Internalization
Quote: "do LLMs internalize typological structures... or are they simply leveraging surface-level correlations"
- Frame: Model as strategic learner
- Projection: The concept of 'internalization' maps the deeply human, cognitive process of integrating external knowledge into a personal, coherent conceptual framework onto the mechanical updating of model weights. While the text commendably contrasts this with 'leveraging surface-level correlations', the initial framing still projects a capacity for profound structural comprehension onto the system. Internalization implies an active, subjective synthesis of meaning, a transformation of external fact into internalized belief. A mathematical model cannot 'internalize' anything; it can only adjust its parameters during optimization to encode multidimensional spatial relationships. This verb attributes a knowing depth to what is strictly a massive, non-conscious curve-fitting exercise.
- Acknowledgment: Hedged/Qualified (The authors qualify the anthropomorphism by immediately contrasting 'internalize' with 'simply leveraging surface-level correlations.' I considered 'Direct' because 'internalize' lacks scare quotes, but the contrastive structure demonstrates the authors are actively questioning the depth of the model's capabilities, functioning as a structural hedge.)
- Implications: Even as a question, framing the debate around whether models 'internalize' structures sets the baseline for AI capabilities incredibly high. It legitimizes the idea that AI might possess genuine, human-like conceptual mastery. This affects epistemic practices in AI research, driving resources toward finding the 'mind' in the machine rather than addressing the material and statistical limits of the technology. If audiences believe AI can 'internalize' concepts, they will trust it with open-ended, highly contextual human judgments, significantly overestimating its reliability.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: Both 'internalize' and 'leveraging' are actions attributed solely to the LLMs. The text erases the engineers who design the optimization objectives and the architecture that forces the 'leveraging' of correlations. I considered 'Ambiguous' but the subject-verb pairing is grammatically explicit in displacing agency to the artifact. Naming the actors would mean discussing how OpenAI's or Meta's specific transformer architectures prioritize structural pattern matching, shifting the focus from the model's hypothetical learning strategies to concrete corporate design choices.
5. Resource Recruitment
Quote: "reflects how open-source models fail to recruit encoded knowledge"
- Frame: Model as executive manager of internal resources
- Projection: This projection maps executive function, deliberate recall, and resource management onto the deterministic propagation of activations in a neural network. To 'recruit' knowledge implies a conscious supervisor within the model that assesses a task, searches an internal library, and intentionally mobilizes necessary information. This dramatically anthropomorphizes the mechanistic process of inference, where an input prompt simply triggers a cascade of matrix multiplications resulting in output probabilities. The model has no executive awareness to 'recruit' or 'fail to recruit' anything; it is simply a mathematical function mapping inputs to outputs. This framing attributes a dynamic, intentional knowing to a static statistical architecture.
- Acknowledgment: Direct (Unacknowledged) (The phrase 'fail to recruit' is presented as an objective, literal finding of the study's comparison between probing and prompting. I considered 'Explicitly Acknowledged' because it appears in the technical discussion, but there is no meta-commentary separating this cognitive metaphor from the literal mechanics of the system.)
- Implications: This metaphor constructs a narrative of latent brilliance hindered by executive dysfunction. It tells the audience that the model already possesses the 'knowledge' (and is therefore highly sophisticated and intelligent) but simply has a minor, temporary issue with accessing it. This encourages immense, unwarranted trust in the underlying system, suggesting that future prompt engineering will 'unlock' a profound truth-teller. It distracts from the reality that if a model cannot output the correct sequence, it does not 'know' the answer in any functional or socially meaningful sense.
Accountability Analysis:
- Actor Visibility: Named (actors identified)
- Analysis: This specific instance actually names 'open-source models' (referring to Meta's LLaMA, discussed extensively in the text), which partially identifies the corporate origin, though it still attributes the failure to the artifact itself. I considered 'Hidden' but because it explicitly contrasts with 'closed-source models' (GPT/Claude), it invokes the specific institutional contexts of these tools. However, the agency of the failure is still displaced onto the model rather than stating that Meta's alignment tuning degrades the output of structurally encoded patterns.
6. Linguistic Action as Capability
Quote: "If models truly learn structured representations of text, they should exhibit efficiencies akin to human narrative understanding"
- Frame: Model as analogous human intellect
- Projection: This framing explicitly maps the entire human faculty of 'narrative understanding' onto the statistical correlations captured by the model. It projects subjective sense-making, empathy, temporal lived experience, and cultural context—which are all required for human narrative understanding—onto mathematical 'efficiencies.' By explicitly linking model learning to human understanding, the text encourages the reader to view the model not as an alien mathematical artifact, but as a synthetic human mind. It blurs the absolute distinction between mechanistic processing (calculating vector distances) and conscious knowing (understanding the meaning of a story).
- Acknowledgment: Hedged/Qualified (The use of the conditional 'If' and the comparative 'akin to' serves as a strong hedge. I considered 'Explicitly Acknowledged' but there is no explicit deconstruction of the metaphor; the authors are testing this proposition as a serious hypothesis, softening it with 'akin' rather than declaring it outright.)
- Implications: By proposing that models might possess something akin to human narrative understanding, the text validates the most extreme anthropomorphic assumptions of its readers. This deeply impacts policy and legal frameworks: if an AI 'understands' narratives like a human, one might argue it deserves intellectual property rights or cannot be held strictly liable like a toaster. It invites the attribution of moral agency to the machine, which systematically diffuses the legal accountability of the corporations that built and deployed the algorithm.
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: The conditional action ('learn') and the expectation ('exhibit') are entirely located within the 'models'. The researchers who train the models and the corporations that profit from them are erased from this theoretical mechanism. I considered 'Ambiguous' because of the hypothetical phrasing, but the grammatical displacement of agency is clear. Replacing this agentless construction would require stating: 'If Meta's engineers successfully encoded structural text representations, the resulting software should output data distributions matching human performance benchmarks.' This restores the human engineers as the actual creators.
7. Spatial Locality as Cognition
Quote: "rhetorical parallelism is strongly encoded... yet largely inaccessible via prompting in open-source models"
- Frame: Model as mind with an inaccessible subconscious
- Projection: This mapping projects the Freudian concept of an inaccessible subconscious mind onto the high-dimensional vector space of a language model. It frames 'encoded' statistical correlations as hidden knowledge, and the prompt-based output as the conscious, accessible surface. It attributes a psychological depth and a hidden reservoir of 'knowing' to a flat matrix of weights. The model does not possess 'inaccessible' knowledge in a cognitive sense; rather, the specific linear projections used by the researchers (probing) can mathematically separate certain features that the model's auto-regressive generation pipeline (prompting) does not utilize when predicting the next token. This treats a difference in mathematical measurement techniques as a profound cognitive division.
- Acknowledgment: Direct (Unacknowledged) (The conclusion presents this asymmetry as a literal, objective fact about the models. I considered 'Hedged/Qualified' because the surrounding text uses technical terms like 'MAP scores', but the specific cognitive framing of 'encoded' vs 'inaccessible' knowledge is stated without any qualifying quotation marks.)
- Implications: This 'subconscious mind' framing is incredibly powerful rhetorically. It reassures users and investors that the models are indeed vastly intelligent, even when they perform poorly, because their 'subconscious' contains the right answers. This creates a dangerous epistemic shield: model failures are no longer seen as fundamental limitations of statistical next-token prediction, but merely as temporary retrieval errors of an otherwise brilliant mind. This insulates AI companies from criticism regarding the inherent limitations of their architectures.
Accountability Analysis:
- Actor Visibility: Partial (some attribution)
- Analysis: The text attributes the state of being 'inaccessible' to 'open-source models', implicitly naming the institutional category (Meta's LLaMA). I considered 'Hidden' because the passive voice ('is strongly encoded') hides who did the encoding, but the explicit contrast between open-source and closed-source (GPT/Claude) in the surrounding text points toward the developers of these specific systems. Still, the formulation obscures the fact that human researchers designed both the probes that find the pattern and the prompting protocols that fail to elicit it.
8. Reasoning as Architecture
Quote: "Analogical reasoning is a core cognitive faculty... While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent"
- Frame: Model as deficit-burdened cognitive agent
- Projection: The text maps 'analogical reasoning'—a profoundly complex, conscious human cognitive faculty involving judgment, context, memory, and semantic synthesis—directly onto the token prediction capabilities of an LLM. When the text notes they 'struggle', it projects the subjective experience of cognitive deficit or confusion onto the system. The model does not 'reason' analogically; it computes the mathematical proximity of vector embeddings. Attributing 'reasoning' to this process collapses the distinction between conscious, justified truth-seeking and statistical pattern matching. It forces the audience to conceptualize algorithmic failure as a cognitive shortcoming rather than a boundary of the training distribution.
- Acknowledgment: Direct (Unacknowledged) (The text directly juxtaposes human 'cognitive faculty' with LLM 'performance' in the same logical flow without distinguishing between the human conscious process and the machine mechanistic process. I considered 'Hedged' because the text points out their failures, but the failure is framed entirely within the literalized metaphor of 'reasoning'.)
- Implications: Equating algorithmic pattern matching with human 'reasoning' fundamentally degrades the definition of reasoning in the public sphere. It leads to profound capability overestimation; if audiences believe the system 'reasons', they will trust it to make logically sound judgments in novel, high-stakes situations (like medical or legal diagnosis) where the training data does not provide surface cues. When the model inevitably relies on shallow statistical correlations and fails catastrophically, the damage is severe, driven by unwarranted trust in a synthetic 'reasoner.'
Accountability Analysis:
- Actor Visibility: Hidden (agency obscured)
- Analysis: The LLMs are presented as the sole actors 'performing' and 'struggling'. The designers of the benchmark, the creators of the models, and the annotators of the training data are totally erased. I considered 'Ambiguous' due to the passive 'is not apparent', but the active verbs assigned to the LLM confirm the displacement. Naming the actor would reframe this: 'When we test the algorithms built by Meta and OpenAI, their mathematical outputs fail to match human analogical reasoning benchmarks unless surface correlations are present in the dataset.' This places the failure on the artifact's design.
Task 2: Source-Target Mapping
About this task
For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.
Mapping 1: A conscious, communicative human agent possessing internal justified beliefs and the ability to intentionally articulate them or withhold them. → The mathematical parameters of a Large Language Model and its auto-regressive token generation pipeline.
Quote: "When Models Know More Than They Say"
- Source Domain: A conscious, communicative human agent possessing internal justified beliefs and the ability to intentionally articulate them or withhold them.
- Target Domain: The mathematical parameters of a Large Language Model and its auto-regressive token generation pipeline.
- Mapping: The mapping structures the LLM as having an internal psychological state. The model's hidden layers and activation weights, which can be linearly separated by a classifier probe to reveal structural patterns, map to the human 'mind' or 'knowing'. The model's output layer, which generates the final text based on next-token probabilities, maps to human 'saying' or vocal articulation. The discrepancy between what can be probed and what is prompted maps to a human intentionally withholding information or struggling to articulate a deep truth.
- What Is Concealed: This mapping completely conceals the deterministic, statistical nature of both the internal layers and the output mechanism. It hides the fact that a 'probe' is a separate, human-trained supervised classifier imposed on the model's activations, not the model's own 'self-knowledge.' It obscures the massive corporate engineering pipeline—RLHF, safety filters, temperature settings—that fundamentally alters the output layer, attributing these corporate design choices to the machine's own 'decision' not to speak.
Show more...
Mapping 2: A student or problem-solver experiencing subjective exertion, cognitive difficulty, and frustration while attempting to complete a challenging intellectual task. → An algorithm computing low probability scores or outputting incorrect token sequences when presented with out-of-distribution or sparsely correlated data.
Quote: "they struggle in cases where an analogy is not apparent on the surface"
- Source Domain: A student or problem-solver experiencing subjective exertion, cognitive difficulty, and frustration while attempting to complete a challenging intellectual task.
- Target Domain: An algorithm computing low probability scores or outputting incorrect token sequences when presented with out-of-distribution or sparsely correlated data.
- Mapping: The human experience of encountering a difficult conceptual problem and expending mental effort is mapped onto a neural network's statistical evaluation process. The absence of strong, aligned mathematical vectors in the model's training data is mapped to a human finding something 'not apparent on the surface.' The resulting generation of mathematically probable but semantically incorrect text is mapped to the human act of 'struggling' to find the right answer.
- What Is Concealed: This conceals the absolute lack of subjective experience, effort, or cognitive friction in the machine. A neural network processes a 'hard' prompt with the exact same blind mathematical determinism as an 'easy' prompt; there is no struggle, only computation. It hides the material reality that the failure is a direct result of the specific, proprietary dataset curated by the developers, which lacked sufficient representations of these abstract structures, shifting the blame from corporate data scarcity to synthetic cognitive difficulty.
Mapping 3: A developing child or learning organism that gradually gains internal, subjective comprehension of human culture and storytelling. → A static, pre-trained neural network whose weights have been optimized to predict tokens correlated with narrative text structures.
Quote: "assessing whether LLMs acquire the competencies that support narrative understanding"
- Source Domain: A developing child or learning organism that gradually gains internal, subjective comprehension of human culture and storytelling.
- Target Domain: A static, pre-trained neural network whose weights have been optimized to predict tokens correlated with narrative text structures.
- Mapping: The biological and psychological process of cognitive development is mapped onto the algorithmic optimization of weights during a training run. Human 'competencies'—which involve lived experience, empathy, and conceptual synthesis—are mapped onto the mathematical capacity to recognize and reproduce sequences of words. 'Understanding', a state of conscious awareness of meaning, is mapped onto high-dimensional vector representations that cluster structurally similar texts together.
- What Is Concealed: This mapping hides the fundamental semantic emptiness of the system. It obscures the fact that the LLM has no access to meaning, ground truth, or reality, relying entirely on the statistical distribution of human-generated tokens. It conceals the immense, invisible labor of human data annotators who structured the RLHF that guides the model's outputs. Furthermore, it obscures the proprietary opacity of models like GPT-5.2 and Claude Opus, making claims about 'understanding' without transparent access to their underlying architectures.
Mapping 4: A human learner choosing between deep, conceptual synthesis (internalization) and shallow, strategic test-taking (leveraging correlations). → The multidimensional geometric representation of text in a transformer model's hidden layers versus the localized N-gram or lexical overlap probabilities.
Quote: "do LLMs internalize typological structures... or are they simply leveraging surface-level correlations"
- Source Domain: A human learner choosing between deep, conceptual synthesis (internalization) and shallow, strategic test-taking (leveraging correlations).
- Target Domain: The multidimensional geometric representation of text in a transformer model's hidden layers versus the localized N-gram or lexical overlap probabilities.
- Mapping: The mapping structures the debate about model architecture as a debate about an agent's learning strategy. The encoding of abstract, distributed patterns across multiple layers of a neural network maps to human 'internalization' (deep learning). The reliance on adjacent, frequent word pairings maps to 'leveraging surface-level correlations' (shallow learning). The algorithm is implicitly granted the agency of a strategic actor employing different epistemic tactics.
- What Is Concealed: This conceals the reality that ALL operations within an LLM are mathematically 'surface-level' in the sense that they are purely syntactic, statistical calculations devoid of semantics. It hides the fact that 'internalizing' is just a more complex, higher-dimensional form of 'leveraging correlations.' By creating a false dichotomy between statistics and 'internalization', it obscures the fundamental architectural limits of transformer models, allowing researchers to chase a ghost of human-like cognition within matrices.
Mapping 5: An executive manager or conscious supervisor within a brain that must locate, access, and mobilize stored information to complete a task. → The feed-forward auto-regressive generation process of an LLM failing to utilize certain vector activations that were identifiable by an external linear classifier.
Quote: "how open-source models fail to recruit encoded knowledge"
- Source Domain: An executive manager or conscious supervisor within a brain that must locate, access, and mobilize stored information to complete a task.
- Target Domain: The feed-forward auto-regressive generation process of an LLM failing to utilize certain vector activations that were identifiable by an external linear classifier.
- Mapping: The human executive function of deliberate recall is mapped onto the transformer's attention mechanism and feed-forward layers. The mathematical features separated by the researchers' external 'probe' are mapped to a static library of 'encoded knowledge.' The auto-regressive output sequence generation is mapped to the active 'recruitment' of this knowledge. When the math does not align, the system is described as 'failing' to act upon its own resources.
- What Is Concealed: This mapping profoundly conceals the presence of the human researchers. The 'encoded knowledge' does not exist independently; it only exists because the researchers built a specific classifier (the probe) to find it. The mapping hides the fact that the 'failure to recruit' is actually a misalignment between two different mathematical optimization processes (the base training vs the prompting/RLHF pipeline). It obscures the proprietary engineering choices made by Meta, framing a design artifact as an autonomous entity's executive failure.
Mapping 6: A human intellect that uses abstraction, empathy, and memory to rapidly comprehend and synthesize stories. → A computational system updating its parameters to minimize loss on a text prediction task and outputting clustered vector representations.
Quote: "If models truly learn structured representations of text, they should exhibit efficiencies akin to human narrative understanding"
- Source Domain: A human intellect that uses abstraction, empathy, and memory to rapidly comprehend and synthesize stories.
- Target Domain: A computational system updating its parameters to minimize loss on a text prediction task and outputting clustered vector representations.
- Mapping: The mapping projects the holistic, conscious experience of human reading and comprehension onto the mechanical adjustment of algorithmic weights. The human ability to quickly grasp a moral or plot twist ('efficiencies') is mapped to the model's ability to cluster structurally similar documents in its vector space without explicit training on that exact task. The 'learning' of humans is treated as structurally identical to the gradient descent optimization of machines.
- What Is Concealed: This mapping completely conceals the absence of lived context, temporal awareness, and biological grounding in the AI. It obscures the fact that 'exhibiting efficiencies' in a computational benchmark (like assigning high similarity scores to two text spans) is a fundamentally different material and ontological process than human 'understanding'. It also hides the specific, brittle testing parameters of the NARB benchmark, suggesting broad, generalized human-like intelligence rather than narrow, task-specific mathematical clustering.
Mapping 7: A human mind with a rich, inaccessible subconscious reservoir of knowledge that cannot be easily articulated. → A neural network where internal layer activations can be linearly separated by an external classifier, but whose generative output layer does not reliably predict tokens reflecting those patterns.
Quote: "rhetorical parallelism is strongly encoded... yet largely inaccessible via prompting"
- Source Domain: A human mind with a rich, inaccessible subconscious reservoir of knowledge that cannot be easily articulated.
- Target Domain: A neural network where internal layer activations can be linearly separated by an external classifier, but whose generative output layer does not reliably predict tokens reflecting those patterns.
- Mapping: The mapping structures the LLM as a deep, stratified psyche. The vector spaces manipulated by the transformer's intermediate layers map to the 'subconscious' mind, where true 'knowledge' (strongly encoded patterns) resides. The final output generation maps to the 'conscious' mind, which is frustratingly blocked from accessing its own depths ('inaccessible'). The prompt maps to an interlocutor trying to coax the truth out of a stubborn or inhibited subject.
- What Is Concealed: This conceals the mechanistic reality that 'prompting' and 'probing' are two entirely distinct mathematical operations performed by human researchers on a passive artifact. A model does not have access or lack of access to itself; it is a mathematical function that runs when executed. The framing hides the specific architectural reasons why auto-regressive generation might wash out intermediate structural signals, transforming a fascinating mathematical engineering problem into a mystical psychological puzzle.
Mapping 8: A human attempting to solve a logic puzzle using conscious deliberation, conceptual mapping, and judgment. → An LLM calculating the statistical likelihood of token sequences based on a massive dataset of human text.
Quote: "Analogical reasoning is a core cognitive faculty... While LLMs perform well when surface and structural cues align, they struggle..."
- Source Domain: A human attempting to solve a logic puzzle using conscious deliberation, conceptual mapping, and judgment.
- Target Domain: An LLM calculating the statistical likelihood of token sequences based on a massive dataset of human text.
- Mapping: The human cognitive process of consciously identifying abstract similarities between disparate concepts—a process rooted in bodily experience and semantic truth—is mapped directly onto the algorithmic pattern-matching of a language model. The model's generation of text is mapped as an active 'performance' of a 'cognitive faculty'. When the math fails to find a high-probability path because the training data lacked dense structural correlations without surface overlap, it is mapped as the agent 'struggling' to reason.
- What Is Concealed: This metaphor conceals the absolute absence of reasoning in the machine. It hides the fact that LLMs do not 'reason' analogically; they retrieve and remix statistical correlations of how humans have historically described analogies in the training data. This obscures the profound dependency of these models on human labor and preexisting human text. Furthermore, it conceals the proprietary opacity of the models being evaluated, allowing researchers to talk about 'cognitive faculties' without full transparency into the training data distributions that actually govern the outputs.
Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")
About this task
This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.
Explanation 1
Quote: "Pretrained models can encode latent information about entities and relations without explicit supervision (Li et al., 2021), and prompting strategies like chain-of-thought (CoT) have been used as evidence that LLMs can perform reasoning-like operations."
-
Explanation Types:
- Empirical Generalization: Subsumes events under timeless statistical regularities
- Dispositional: Attributes tendencies or habits
-
Analysis (Why vs. How Slippage): This explanation frames the AI in a hybrid register, moving from mechanistic description to dispositional/agential attribution. The first half ('encode latent information... without explicit supervision') relies on an empirical generalization about the mathematical properties of unsupervised pretraining, explaining how the system structures data. The second half, however, introduces agential framing ('perform reasoning-like operations'), driven by a dispositional view of the model's behavior under specific prompting conditions. This choice emphasizes the sophisticated outcomes of the model, elevating statistical token prediction to the level of cognitive behavior, while obscuring the absolute mechanistic reality that 'CoT' is merely a technique to force the model into a higher-probability autoregressive path, not an actual simulation of logical thought.
-
Consciousness Claims Analysis: The passage avoids explicit consciousness verbs but leans heavily on cognitive approximations. Verbs like 'encode' are mechanistic, but 'perform reasoning-like operations' dances on the edge of epistemic attribution. The text carefully hedges with '-like', maintaining a technical assessment of processing rather than explicitly claiming conscious knowing. However, the curse of knowledge dynamics are visible here: the authors, who understand the logical steps of human reasoning, project that same logical structure onto the chain-of-thought outputs generated by the model. Mechanistically, CoT simply extends the context window, allowing the model to generate intermediate tokens that alter the probability distribution of subsequent tokens, leading to more accurate outputs based on training data patterns. There is no 'reasoning' occurring, only conditional probability calculations.
-
Rhetorical Impact: By framing the model's outputs as 'reasoning-like operations', the text dramatically shapes audience perception, inviting them to view the AI as an autonomous agent capable of deliberate logic. This subtle consciousness framing significantly inflates perceived reliability and trust; if an audience believes a model is 'reasoning', they are more likely to trust its conclusions in novel scenarios, assuming it can deduce truth. If the audience understood it purely as conditional token prediction, they would be far more skeptical of its ability to handle out-of-distribution problems.
Show more...
Explanation 2
Quote: "This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information."
-
Explanation Types:
- Functional: Explains behavior by role in self-regulating system with feedback
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This explanation relies strongly on a functional and theoretical framework. It attempts to explain the discrepancy between two measurements (probing and prompting) by positing a mechanistic structure where 'prompting accesses available information'. It frames the AI largely mechanistically, focusing on how internal components interact. The choice to emphasize 'internal representations' and 'prompted behavior' emphasizes the system as a complex, multi-layered artifact. However, the phrase 'accesses available information' slightly obscures the reality by implying a spatial or retrieval-based architecture where 'information' sits waiting to be 'accessed', rather than acknowledging that the generation process dynamically calculates probabilities based on the current context window and static weights.
-
Consciousness Claims Analysis: This passage largely avoids attributing conscious states, utilizing mechanistic nouns ('representations', 'behavior', 'limitations') and verbs ('reflect', 'accesses'). It correctly assesses the system as processing rather than knowing. The epistemic claim is structural: the model contains encoded patterns ('information') that a specific extraction method ('prompting') fails to utilize. Mechanistically, this is a precise description: an external linear classifier (the probe) can find a separating hyperplane in the high-dimensional activation space of the hidden layers that correlates with rhetorical structure, but the final softmax layer that predicts the next token does not strongly weight these structural activations unless explicit surface cues are present. The explanation avoids the curse of knowledge by sticking close to the mathematical reality of the system's architecture.
-
Rhetorical Impact: This mechanistic framing reduces the perceived autonomy and agency of the AI, correctly positioning it as an artifact with structural limitations. By focusing on how 'prompting accesses' information, it subtly shifts the focus onto the human technique of interacting with the machine, rather than the machine's own cognitive failures. This maintains a healthy boundary for trust: the system is reliable only insofar as the retrieval mechanism aligns with the encoded statistics. It strips away the illusion of mind, encouraging an engineering mindset toward the tool.
Explanation 3
Quote: "These findings suggest that LLMs can sometimes exhibit analogical reasoning, especially under structured prompting regimes and at larger scales."
-
Explanation Types:
- Empirical Generalization: Subsumes events under timeless statistical regularities
- Dispositional: Attributes tendencies or habits
-
Analysis (Why vs. How Slippage): This explanation operates primarily as an empirical generalization based on observed behavior, but relies heavily on dispositional framing. By stating that models 'exhibit analogical reasoning', it frames the AI agentially, defining its behavior by what it appears to do (reason) rather than how it works (correlate). The choice of the verb 'exhibit' is a classic behavioral hedge, allowing the authors to describe the output as reasoning without strictly claiming the internal mechanism is reasoning. However, this choice emphasizes the human-like quality of the output while totally obscuring the statistical machinery that produces it. It hides the fact that 'larger scales' simply means more parameters and a vaster training corpus of human analogies to probabilistically mimic.
-
Consciousness Claims Analysis: The passage flirts dangerously with consciousness attribution through the phrase 'analogical reasoning'. While 'exhibit' is a behavioral verb, 'reasoning' is an inherently epistemic and conscious process involving justified belief and logical deduction. The authors are projecting human understanding onto the system's outputs. Mechanistically, the model does not reason; when subjected to 'structured prompting regimes' (like providing an example analogy in the prompt), the attention mechanism heavily weights the structural tokens of the example, dramatically increasing the probability that the generated output will mirror that syntax and semantic relationship based on patterns learned during pretraining. The text attributes knowing to a process that is strictly generative processing.
-
Rhetorical Impact: Framing the model as capable of 'exhibiting analogical reasoning' profoundly shapes audience perception, granting the AI a high degree of intellectual autonomy and sophistication. This cognitive framing invites users to trust the system as a collaborative intellectual partner capable of abstract thought. If policymakers or enterprise users believe the AI is 'reasoning', they are likely to deploy it in unconstrained, high-stakes environments. Recognizing it instead as a sophisticated mimetic engine dependent on structured prompts would appropriately calibrate trust, highlighting its fundamental brittleness.
Explanation 4
Quote: "The scalar mixture assigns a learned weight to each layer, producing a weighted sum of layer representations. We contrast the resulting performance with that obtained from probes trained on embeddings extracted from single layers in isolation."
-
Explanation Types:
- Functional: Explains behavior by role in self-regulating system with feedback
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This is a purely mechanistic, functional, and theoretical explanation. It explains exactly how a specific methodological technique operates upon the system. It frames the AI entirely as a mathematical artifact. There is no agential 'why' here, only a structural 'how'. The choice to use precise technical language ('scalar mixture', 'learned weight', 'weighted sum', 'embeddings') emphasizes the rigorous, mathematical nature of the research methodology. It completely strips away any anthropomorphic illusions, focusing the reader's attention on the concrete reality of matrix operations. This framing obscures nothing; it brings the material reality of the computational process to the absolute forefront.
-
Consciousness Claims Analysis: This passage is devoid of consciousness verbs and epistemic claims. It utilizes entirely mechanistic verbs ('assigns', 'producing', 'obtained', 'extracted', 'trained'). There is no attribution of knowing, only processing and calculation. The 'curse of knowledge' is absent, as the authors accurately describe the mathematical operations they are performing on the system. The actual mechanistic process is exactly as described: a separate algorithmic process (the probe) is being trained to apply a scalar weight to the high-dimensional vector outputs (embeddings) of individual transformer layers to see if the combined mathematical representation better correlates with a set of human-annotated labels. It is a perfect example of transparent, technically precise writing.
-
Rhetorical Impact: The rhetorical impact of this mechanistic framing is highly stabilizing. It grounds the reader in the reality of the AI as a complex computational tool, completely devoid of agency, autonomy, or consciousness. This framing does not build relation-based trust; rather, it builds performance-based trust based on transparent methodologies. By exposing the mathematical reality of the system, it makes the idea that the AI 'knows' or 'understands' seem absurd. Decisions made based on this framing would be grounded in engineering metrics and statistical reliability, rather than misplaced faith in a synthetic mind.
Explanation 5
Quote: "For rhetorical parallelism, LLaMA-3.2-1B-Instruct achieves MAP of 0.93 when probed but only 0.18 when prompted... indicating that rhetorical structure is linearly decodable yet inaccessible through instruction-following."
-
Explanation Types:
- Empirical Generalization: Subsumes events under timeless statistical regularities
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This explanation elegantly combines empirical generalization (reporting the stark difference in performance metrics) with a theoretical explanation (interpreting the meaning of those metrics). It frames the AI fundamentally mechanistically (how it behaves under different tests), but introduces a subtle agential limitation in the final clause. The use of 'linearly decodable' is a superb mechanistic description of the probing process. However, the phrase 'inaccessible through instruction-following' subtly shifts toward an agential framing. It implies the model tries to follow instructions but finds the information 'inaccessible', slightly obscuring the reality that the auto-regressive generation pipeline simply does not map those deep structural activations to the final token probabilities.
-
Consciousness Claims Analysis: The passage mostly avoids consciousness verbs, grounding its claims in 'achieves' and 'is linearly decodable'. The use of 'instruction-following' is a standard but slightly anthropomorphic industry term that implies comprehension of a command, but it is bounded by the surrounding technical context. The text accurately assesses the system as processing rather than knowing. The mechanistic reality is that the internal layer activations form distinct spatial clusters corresponding to rhetorical structures (decodable by a separate linear classifier), but the specific fine-tuning (RLHF/instruction tuning) applied to this Meta model causes its attention heads to prioritize other features when calculating the softmax probabilities for the next token sequence.
-
Rhetorical Impact: This framing effectively highlights the complex, artifactual nature of the LLM. It shows that the system is not a unified 'mind', but a bundle of different mathematical capabilities that do not perfectly align. This reduces the perception of autonomy and forces the audience to view the model as a tool with specific, sometimes counterintuitive, structural constraints. By demonstrating that the model can 'encode' structure without being able to output it, the framing challenges the assumption that prompt-based interactions reveal the full extent of the system's capabilities, encouraging a more cautious, technically informed approach to AI evaluation and trust.
Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language
About this task
This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.
V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.
| Original Anthropomorphic Frame | Mechanistic Reframing | Technical Reality Check | Human Agency Restoration |
|---|---|---|---|
| When Models Know More Than They Say | When the internal mathematical weights of a model contain linearly separable statistical patterns that its autoregressive generation pipeline fails to output as text. | Models do not possess justified belief (knowing) or intentional communication (saying). Mechanistically, researchers can train external classifiers to find high-dimensional spatial correlations in the model's hidden layers that the model's own next-token prediction function does not heavily weight during output generation. | N/A - describes computational processes without displacing responsibility. |
| they struggle in cases where an analogy is not apparent on the surface | The models fail to output statistically correlated token sequences when the testing benchmark lacks the structural text adjacencies present in their training data. | Algorithms do not experience subjective exertion or 'struggle'. Mechanistically, when a prompt lacks surface-level textual overlap with its training distribution, the attention mechanism cannot calculate high-probability pathways to generate the human-expected analogical output. | N/A - describes computational processes without displacing responsibility. |
| assessing whether LLMs acquire the competencies that support narrative understanding | Assessing whether human engineers have successfully designed training objectives that force LLMs to mathematically encode structural features of human narratives. | LLMs do not experience conscious awareness or 'understanding'. Mechanistically, the model classifies and processes token embeddings, continually adjusting internal weights during training to minimize prediction error across a vast corpus of narrative text. | Engineers at companies like Meta and OpenAI actively select the datasets and design the reinforcement learning pipelines that determine which statistical features these models encode. |
| do LLMs internalize typological structures... or are they simply leveraging surface-level correlations | Do transformer architectures encode highly distributed, multi-layer geometric representations of text structures, or do their outputs rely predominantly on localized N-gram and syntactical probabilities? | A matrix of parameters cannot 'internalize' knowledge into a cognitive framework. Mechanistically, the system dynamically calculates token probabilities. The question is whether its attention heads operate on deep, abstracted feature spaces across many layers or heavily weight immediate, adjacent token pairs. | N/A - describes computational processes without displacing responsibility. |
| reflects how open-source models fail to recruit encoded knowledge | Reflects how Meta's instruction-tuning pipeline creates an output generation function that does not heavily weight the deeper structural representations encoded in the base model's hidden layers. | The model possesses no executive function or conscious awareness to 'recruit' information. Mechanistically, the softmax layer that generates the final output token simply does not align with the hyperplanes identified by the researchers' external probes. | Meta's alignment researchers designed an instruction-following optimization protocol that mathematically suppresses or ignores the structural representations present in the pre-trained base model. |
| If models truly learn structured representations of text, they should exhibit efficiencies akin to human narrative understanding | If engineers successfully optimize models to map structural text features into distinct vector spaces, the resulting software should cluster narratives accurately on human-designed mathematical benchmarks. | Algorithms do not 'learn' or 'understand' in the biological or cognitive sense. Mechanistically, the gradient descent process updates numerical weights. To equate this mathematical curve-fitting with the conscious, empathetic, and contextual lived experience of human narrative understanding is a profound category error. | N/A - describes computational processes without displacing responsibility. |
| rhetorical parallelism is strongly encoded... yet largely inaccessible via prompting in open-source models | Rhetorical structure creates linearly separable clusters in the hidden layer vector space, but Meta's LLaMA generation pipeline does not utilize these clusters when predicting next tokens from a prompt. | A model does not possess a subconscious mind containing 'inaccessible' knowledge. Mechanistically, 'probing' is a separate classifier built by researchers to find mathematical patterns, while 'prompting' relies on the model's own heavily filtered autoregressive output function. The two simply rely on different statistical mechanisms. | Meta's engineers deployed a specific reinforcement learning architecture (RLHF) that mathematically filters the output generation, overriding the structural text correlations present in the model's base training. |
| Analogical reasoning is a core cognitive faculty... While LLMs perform well when surface and structural cues align, they struggle... | Analogical reasoning requires human consciousness. While LLMs output high-probability text when prompts mirror frequent training data patterns, their statistical token prediction fails to match human benchmarks on out-of-distribution abstract structures. | Computational pattern-matching is not 'reasoning'. The model does not use logic, judgment, or conceptual mapping. Mechanistically, it generates sequences based on conditional probabilities; if the specific abstract structure was rare in the training data, the generated output will not reflect the requested analogy. | The designers of the LLMs failed to curate training datasets that contained sufficient density of abstract structural correlations to allow the statistical engine to generalize to these specific analogical benchmarks. |
Task 5: Critical Observations - Structural Patterns
Agency Slippage
The text exhibits a systematic and highly functional oscillation between mechanical precision and agential anthropomorphism. This slippage primarily moves in a mechanical-to-agential direction over the course of the paper, serving to launder metaphor through mathematics. The text begins with bold, agential claims in the title and introduction (e.g., Models 'know' more than they 'say'; they 'acquire understanding'; they perform 'reasoning-like operations'). As it transitions into the methodology and results sections, the agency abruptly shifts to human researchers and mathematical operations: 'The scalar mixture assigns a learned weight', 'span representations are obtained via mean pooling', 'we compute cosine similarity'. Here, the system is strictly a passive artifact. However, in the discussion and conclusion, the text slips dramatically back into agential framing, using the mechanical findings to justify massive consciousness projections: the models 'fail to recruit encoded knowledge', demonstrating a 'dissociation between what models know... and what they can do.'
This pattern reflects a profound 'curse of knowledge.' The researchers, possessing immense technical expertise and understanding the deep mechanisms of the transformer architecture, project their own cognitive coherence onto the model. Because they can find a mathematical pattern using a probe, they assume the model 'knows' that pattern, transforming a structural property of a matrix into an epistemic state of an agent.
Furthermore, this slippage relies heavily on agentless passive constructions to erase the human engineers who built the models. The text says 'rhetorical parallelism is strongly encoded' and 'information is both weakly encoded and weakly accessible,' obscuring the fact that engineers at Meta, OpenAI, and Anthropic actively made the decisions about training data distributions, architectural parameters, and RLHF objectives that caused these specific encodings. By oscillating between viewing the AI as a mechanistic 'how' (Functional/Theoretical explanations) and an agential 'why' (Intentional/Dispositional explanations), the text accomplishes a vital rhetorical task: it makes it sayable that an algorithm possesses a subconscious mind, insulating the corporate creators from the limitations of their product by blaming the machine's failure to 'recruit' its own intelligence.
Metaphor-Driven Trust Inflation
The paper constructs and leverages trust through a complex architecture of consciousness metaphors, specifically by invoking an 'epistemic dualism' that treats the AI as possessing both a conscious and a subconscious mind. By framing the central finding as a gap between what the model 'knows' (internal representations) and what it 'says' (prompted output), the text inadvertently cultivates a highly dangerous form of relation-based trust. It suggests that the AI is not merely a statistical tool, but a deeply sophisticated entity that holds a vast, hidden reservoir of justified true beliefs.
This consciousness framing signals a profound, almost mystical competence. When the authors claim the model 'struggles' or 'fails to recruit encoded knowledge,' they apply frameworks of human sincerity and executive dysfunction to a machine. If a human struggles to recall a fact, we still trust their underlying intelligence; applying this same logic to an LLM suggests that the model's failures are mere superficial glitches rather than fundamental absences of capability. This transfers our human-trust frameworks inappropriately onto statistical systems.
Furthermore, this anthropomorphism completely reframes system limitations. When the open-source LLaMA model fails the prompted analogical reasoning task, the text does not conclude that the model is a defective product fundamentally incapable of abstract logic. Instead, framed agentially, the failure becomes an intriguing psychological mystery: the model 'knows' the answer but 'fails to recruit' it. This narrative protects the perceived authority and reliability of the AI. It encourages users and policymakers to extend deep relation-based trust to systems that are fundamentally incapable of reciprocating or possessing ethical vulnerability. The risk emerges when audiences, convinced that the AI 'knows' the right answer deep down, deploy these systems in high-stakes environments (law, medicine), assuming that better prompt engineering will eventually coax the hidden, infallible truth out of the machine's subconscious.
Obscured Mechanics
The anthropomorphic language and consciousness framings in this text conceal vital technical, material, and corporate realities. Applying the 'name the corporation' test reveals the depth of this obscuration. When the text discusses 'how open-source models fail to recruit encoded knowledge' or how 'closed-source models achieve probing-level performance,' it treats these artifacts as autonomous natural phenomena. It entirely obscures the specific engineering teams at Meta (LLaMA), OpenAI (GPT), and Anthropic (Claude) who actively designed the architectures, scraped the specific datasets, and implemented the alignment protocols (like RLHF) that dictate these behaviors.
The claim that an AI 'knows/understands' something hides profound technical dependencies. It conceals the model's absolute reliance on the statistical distribution of its training data; a model does not 'know' an analogy, it merely reproduces the probability of token sequences found in human writing. It hides the absence of any grounding in external reality or causal models. Furthermore, it obscures the proprietary opacity of the systems. The text makes confident assertions about the capabilities of GPT-5.2 and Claude Opus without acknowledging that their training data, parameter counts, and alignment mechanisms are completely hidden trade secrets.
Labor and commercial realities are also erased. The 'alignment' that causes the model to 'fail to say' what it 'knows' is heavily dependent on the underpaid labor of global data annotators who ranked outputs to train the reward models. The metaphorical framing conceals this human labor, attributing the resulting behavior to the machine's own cognitive processes. The primary beneficiaries of this concealment are the AI corporations. By framing LLMs as autonomous, thinking entities with 'latent knowledge,' the discourse protects their astronomical valuations and deflects scrutiny away from their opaque data practices and the inherent brittleness of their products. If the metaphors were replaced with mechanistic language, the discourse would immediately reveal that these are flawed corporate software products highly dependent on stolen human data, not nascent digital minds.
Context Sensitivity
The distribution and intensity of anthropomorphic language across the text is highly strategic, following a distinct U-shaped curve. The introduction and conclusion are densely packed with intense consciousness claims ('know', 'say', 'understand', 'reasoning', 'cognitive faculty'). The methodological middle sections, however, are virtually devoid of anthropomorphism, relying strictly on precise mechanistic language ('cosine similarity', 'mean pooling', 'linear classifiers', 'scalar mixtures').
This structural asymmetry serves a vital rhetorical function: the text establishes scientific credibility and rigor through the mechanical language of the methodology, and then leverages that exact credibility to license the aggressive anthropomorphism of the conclusion. The math provides an aura of objective truth, which is then cashed out to validate the illusion of mind.
Crucially, there is a profound asymmetry in how capabilities versus limitations are framed. Capabilities are consistently framed in agential, conscious terms: the models can 'exhibit analogical reasoning,' they 'internalize typological structures,' and they 'know' patterns. However, limitations are often framed either in mechanical terms ('limitations in how prompting accesses available information') or as agential failures of a fundamentally brilliant mind ('struggle,' 'fail to recruit'). This asymmetry accomplishes a vital marketing and vision-setting goal: it ensures that successes are attributed to the model's growing intelligence, while failures are written off as minor technical glitches in the interface (prompting) or temporary psychological blocks. The text shifts seamlessly from acknowledged analogies ('akin to human narrative understanding') to literalized claims ('When Models Know More Than They Say'). This pattern reveals an implied audience of both technical peers (who require the math) and broader stakeholders or funders (who are captivated by the narrative of artificial minds), demonstrating how technical discourse actively constructs the mythos of autonomous AI.
Accountability Synthesis
This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.
Synthesizing the accountability analyses reveals a systemic architecture of displaced responsibility. The text systematically diffuses human accountability by treating the AI models as the primary loci of agency and decision-making. Throughout the paper, actors are overwhelmingly 'Hidden' or only 'Partially' named. The 'Models' are the grammatical subjects that 'perform', 'struggle', 'learn', 'encode', and 'fail to recruit.' The human researchers who design the probes are visible, but the corporate engineers at Meta, OpenAI, and Anthropic who actively built the fundamental flaws into the systems are entirely erased.
This linguistic pattern creates a massive 'accountability sink': the model's internal representations. When a system fails to output a correct analogy despite having the structural pattern encoded in its weights, the responsibility does not fall on the corporate developers for creating a brittle, misaligned product. Instead, the responsibility disappears into the abstraction of the machine's 'subconscious'; the model simply 'failed to recruit' its own knowledge. This transfers agency to the artifact, absolving the creators.
The liability implications of this framing are profound. If policymakers and the legal system accept the framing that an AI 'knows' things but autonomously 'fails to say' them, it becomes nearly impossible to hold corporations strictly liable for algorithmic harms. The failure is viewed as an unpredictable cognitive glitch of an autonomous agent rather than a predictable defect of a statistical product.
Naming the actors would fundamentally alter this landscape. If we replace 'open-source models fail to recruit encoded knowledge' with 'Meta's instruction-tuning algorithms actively suppress the structural correlations present in their base training data,' entirely new questions become askable. We can ask why Meta chose that specific tuning, what datasets they used, and how they can be regulated to fix the defect. The current text benefits the AI industry by obscuring their design choices and profit motives behind the fascinating, distracting illusion of a struggling digital mind.
Conclusion: What This Analysis Reveals
The discourse within the analyzed text is driven by a profound and interlocking system of anthropomorphism, dominated primarily by the 'Epistemic Dualism' pattern and supported by the 'AI as Cognitive Agent' pattern. The foundational assumption—the load-bearing pillar that must be accepted for the entire paper's narrative to function—is that computational artifacts possess an internal, subjective locus of cognition that is distinct from their material outputs. This Epistemic Dualism maps the mathematical separation of vector embeddings onto the human psychological concept of an inaccessible 'subconscious' mind ('knowing'), while mapping the autoregressive generation of tokens onto conscious, deliberate articulation ('saying'). This architecture is highly sophisticated; it does not merely map simple verbs, but constructs a complex analogical structure of a divided digital psyche. This consciousness projection is essential because it allows the authors to reframe a mechanical failure (the inability of the generation pipeline to utilize structural weights) as a psychological mystery (the model has the knowledge but fails to 'recruit' it). If you remove this foundational illusion of a 'knowing' subject, the entire narrative collapses, revealing a straightforward engineering report about mismatched statistical optimization pipelines.
Mechanism of the Illusion:
This text creates the 'illusion of mind' through a masterful rhetorical sleight-of-hand: the literalization of a psychological metaphor via mathematical measurement. The central trick relies on the 'curse of knowledge'. The authors, understanding the highly abstracted, structural concepts of rhetorical parallelism, build a linear classifier (a probe) that successfully separates the model's hidden layers along those conceptual lines. Because the researchers found the pattern, they project that awareness back onto the machine, claiming the model 'knows' the pattern. They then establish a temporal and logical chain: first proving the mathematical presence of the vectors (mechanistic credibility), and then applying consciousness verbs ('know', 'understand', 'recruit') to explain why those vectors don't emerge in the final text. This exploits the deep vulnerability of an audience primed by science fiction and industry hype to desire the discovery of artificial sentience. By framing the model's failure to output the correct analogy as a 'struggle' to 'access' its own 'encoded knowledge', the text shifts subtly from a rigorous analysis of parameter weights into a psychoanalysis of a digital subject, seducing the reader into accepting the existence of a ghost within the machine.
Material Stakes:
Categories: Epistemic, Regulatory/Legal, Institutional
The metaphorical framings employed in this text carry severe material consequences across multiple domains. Epistemically, when peer-reviewed literature validates the concept that models 'know' things they do not 'say', it warps the trajectory of AI research. Funding and institutional energy are diverted away from understanding the material limitations of training data distributions and toward a quasi-mystical pursuit of unlocking the 'hidden minds' of algorithms. This epistemic distortion directly feeds Institutional behavior: tech companies use this framing to hype their products, claiming their models possess vastly superior 'latent capabilities' that simply need better prompt engineering, thereby driving up valuations based on phantom intelligence.
Most critically, the Regulatory and Legal stakes are immense. If legislators and judges adopt the framing that an LLM 'understands' concepts but occasionally 'struggles' to articulate them, it creates a massive accountability sink. It shifts the legal paradigm from product liability (where a corporation is responsible for a defective statistical tool) to a framework resembling human negligence (where the AI made a 'mistake' despite 'knowing' better). This ambiguity heavily benefits the tech conglomerates—who escape strict regulation and liability for algorithmic harms—while the cost is borne by the public, who are subjected to toxic, biased, or hallucinated outputs generated by unaccountable, unregulated statistical engines masked as struggling minds.
AI Literacy as Counter-Practice:
Critical literacy and mechanistic precision serve as a vital counter-practice to the material risks of anthropomorphism. Reframing the paper's title from 'When Models Know More Than They Say' to 'When Model Weights Encode Statistical Parallels That Output Generators Do Not Extract' fundamentally strips away the illusion of the digital subconscious. By replacing the consciousness verbs ('knows', 'understands') with precise mechanistic actions ('processes', 'correlates', 'clusters vectors'), we force the recognition that the system has no awareness, no deep reservoir of truth, and relies entirely on human data.
Furthermore, restoring human agency—changing 'models fail to recruit' to 'Meta's instruction-tuning pipeline suppresses deeper structural correlations'—forces accountability back onto the corporate creators. Systematic adoption of this literacy would require academic journals and conferences (like ACL) to mandate strict style guides that ban unhedged epistemic verbs when describing algorithmic processes. Researchers would need to commit to distinguishing between their mathematical measurements and the ontological status of the machine. Naturally, this precision faces immense resistance from the AI industry and its institutional beneficiaries. Anthropomorphic language serves their financial interests by hyping capabilities, mystifying failures, and diffusing liability. Mechanistic literacy threatens this power by exposing LLMs as deeply flawed, highly dependent corporate artifacts rather than autonomous, magical intellects.
Path Forward
Looking at the broader discursive ecology, the choice of vocabulary profoundly shapes what is possible in AI governance and understanding. The current status quo, which freely mixes mechanistic math with aggressive anthropomorphism ('models know and reason'), maximizes narrative resonance and industry hype but renders critical regulation nearly intractable, as it obscures the material reality of the technology. Adopting a strictly mechanistic vocabulary ('the model retrieves and ranks tokens based on probability distributions') grounds the technology in reality and clarifies liability, making it obvious that developers are responsible for the weights they tune. However, this precision comes at the cost of public accessibility, as dense mathematical descriptions alienate non-experts.
We can map several potential discourse futures. If the anthropomorphic trajectory deepens, we risk a future where AI systems are granted quasi-legal personhood or moral agency, embedding the assumption that machines are collaborative partners rather than tools, severely eroding human accountability and leaving society vulnerable to unregulated algorithmic decision-making. Conversely, if institutional structural changes are made—such as funding bodies requiring mechanistic translations of capability claims, or regulatory frameworks demanding full disclosure of training data rather than accepting claims of 'latent knowledge'—a future of mechanistic precision could emerge. This would solve the liability ambiguity and demystify the technology, though it would require immense effort to educate the public on statistical probability. Ultimately, different stakeholders have radically different incentives: tech corporations benefit from the awe and legal cover provided by the 'AI knows' discourse, while civil rights advocates, regulators, and the public fundamentally require the 'model processes' discourse to demand transparency, assert control, and enforce accountability.
Extended Processing Summary
Gemini's Processing Steps
Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.
Analyzing the Request
I'm now diving into this critical discourse analysis of AI metaphor. I'm focusing on identifying instances of anthropomorphism and analogical reasoning within the provided text. My goal is to pinpoint how language shapes our understanding of AI capabilities. I'm looking at potential biases.
Defining the Analytical Framework
I've established the roles I'm embodying for this analysis: a critical discourse analyst leveraging Lakoff & Johnson and CDA principles, influenced by Robert Brown. The focus is now on consciousness projections, displaced human agency, and the tension between mechanistic precision and agential language. I've designed the output structure with five structured JSON tasks to facilitate the audit.
Examining the Article's Core
I'm now closely examining the Lippincott and Sirin paper. Its core argument contrasts model "knowing" with their outputs, especially in analogical reasoning. The paper's pervasive use of cognitive metaphors like "reasoning," "know," and "understand" is striking. "Agency displacement" through phrases like "models learn" and "LLMs internalize" is also worth noting.
I've already begun drafting the first audit task, focusing on metaphors and anthropomorphism. My initial instance, the paper's title "When Models Know More Than They Say," clearly frames the model as a conscious agent, attributing justified belief and intentional communication. A second instance, the abstract's language regarding models "struggling" with analogies, casts them as effortful problem-solvers.
Identifying Cognitive Projections
I'm now cataloging additional instances from the paper, like the model's "struggle" with analogies, which implies subjective exertion. I've also identified instances where the model "acquires competencies" and "internalizes structures," projecting learning and human-like understanding. Furthermore, I've noted the framing of the model as a strategic learner and the explicit acknowledgment of the tension between learning and surface-level correlations. The dualistic framing of "what models know" versus "what they can do" is a crucial epistemic point.
Cataloging Agency Metaphors
I'm presently focused on detailing further instances of agency projection within the source material. I've noted how the phrase "open-source models fail to recruit encoded knowledge" frames models as agents managing internal resources, projecting executive function. The concept of "assessing how knowledge is represented and accessed" similarly paints the model as an epistemic repository. I am also assessing schema constraints for task one. Specifically, I'm examining the need for acknowledgment and accountabilityAnalysis.
Mapping Analogical Transfers
I'm now identifying additional analogical transfers, like the student struggling with a test, which I'm mapping to a model statistically failing to predict tokens. The reader comprehending a story is analogous to a model encoding statistical correlations. Human memory is paralleled to gradient descent and weight updates. I've already begun drafting the third task: explanation audit, using Brown's typology.
Refining Instance Analysis
I'm now detailing additional instances of analogical reasoning, focusing on dispositional/empirical generalizations in structured prompting, and functional/theoretical explanations in model architecture descriptions. I'm also analyzing rhetorical parallelism within the model's MAP, exploring theoretical and empirical generalizations. My focus is sharpened on these nuanced classifications for task three. I've successfully drafted the reframing for task four, and am beginning to enumerate the reframed instances.
Revising Critical Task
I'm now revising my analysis, specifically focusing on critical task five. I'm noting the linguistic oscillation between seeing LLMs as statistical objects and epistemic subjects, particularly in the introduction and conclusions. I'm focusing on "engineering failure" versus "psychological" attribution and the implications for trust. I've begun to enumerate my points in detail for Task 5.
Synthesizing Epistemic Dualism
I'm now integrating my insights to articulate the Epistemic Dualism pattern. I'm focusing on how the paper oscillates between viewing LLMs as statistical entities and epistemic subjects, especially in obscured mechanics, context sensitivity, and accountability synthesis, for task five. I'm noting how "engineering failure" is downplayed, while "psychological" attributions are emphasized.
Articulating Epistemic Tensions
I'm now articulating the core epistemic tension within the paper, specifically the problematic framing of computational artifacts with distinct internal cognition. I'm focusing on the illusionary mechanism of translating probe performance to "unconscious knowing" and prompt performance to "conscious saying." I'm also examining the material stakes, including research funding, public perception, and advocating for dual terminology.
Validating Compliance & Scope
I'm now meticulously reviewing my audit structure against the specified constraints, especially regarding JSON syntax compliance. I'm focusing on ensuring all text-based fields adhere to the JSON format, specifically avoiding control tokens, and using proper escaping for special characters. I'm also critically checking word counts for each task and ensuring that the contrastive justifications within Task 1 are robust, demonstrating the clear rationale behind each categorization. I'm making sure my categorization is not entirely uniform in Task 1.
Refining Visibility Metrics
I'm now refining the visibility metrics for each instance, using "Partial," "Hidden," and "Named" to capture agency. I'm focusing on "Agency Visibility" to classify references to AI agents, their outputs and their behavior. For each instance, I'm noting specific examples, such as "models fail" which I will mark as "Hidden" and "OpenAI/GPT/Claude," which I will mark as "Named." My goal is to ensure each description is accurate and that I meet word count targets for the JSON fields.
Finalizing Data Formatting
I'm now putting the finishing touches on my JSON formatting, making absolutely sure that every string complies with the "no control tokens" rule. I'm focusing on ensuring each task's long textual descriptions, especially within task one, are single, unbroken paragraphs. I'm very carefully reviewing and validating all fields to ensure data integrity and full JSON compliance for task one. I'm also ensuring that instance counts and word counts are met, and that the structure is completely valid.
Clarifying Acknowledgment Categories
I'm currently refining the acknowledgment categories for task one. The initial categorization of the title, "When Models Know More Than They Say," as "Direct (Unacknowledged)" felt slightly incongruous. I'm now exploring the nuances of "explicit acknowledgement" where rhetorical shorthand is used, but there are no scare quotes. My analysis of the text then proceeded to build its argument on the initial dichotomy without further qualification.
Ensuring Format Compliance
I'm now laser-focused on the stringent format requirements for the JSON output. Specifically, I'm ensuring no control tokens like \n are present, not even for formatting. I'm building a single, minified JSON string to guarantee full compliance. I'm making sure it is all on one line.
Validating Instance Classification
I'm now deeply immersed in Task 1, refining the instance classifications and making sure I've met all word count targets and enum matches. I've re-evaluated the "Hidden" instances based on the refined visibility metrics, specifically ensuring that when a model's agency is obscured, that these are classified accordingly. I am also checking against enum matches for all of Task 1.
Categorizing Agent Status
I'm now implementing visibility classifications for agency attribution. My categories are "Hidden," "Named," and "Partial," with supporting examples for each. For instance, "Hidden" applies to passive constructions, and "Named" to specific model references. I'm carefully constructing the text for Tasks 5 and 6, and am confirming that I meet the word limit.
Tracing the Argument's Progression
I'm currently tracing the argument's progression throughout the text. I'm focusing on the oscillation between technical framing and anthropomorphic claims. Specifically, I'm examining how mechanical descriptions in the methodology lend credibility to agential claims made in the introduction and conclusion. I'm noting how this framing enables the authors to attribute "knowing" to the model.
Dissecting Agential Framing
I'm currently focusing on how the text's naturalization of LLMs, likening them to humans reliant on narratives, obscures deliberate engineering. I'm noting how "models fail to recruit knowledge" hides the labor of data teams. My strategy now is to form the JSON fully, escaping quotes and ensuring the output is one single, long line.
Run ID: 2026-05-03-when-models-know-more-than-they-say-prob-metaphor-aix46b
Raw JSON: 2026-05-03-when-models-know-more-than-they-say-prob-metaphor-aix46b.json
Framework: Metaphor Analysis v6.5
Schema Version: 3.0
Generated: 2026-05-03T11:26:13.649Z
Discourse Depot © 2025 by TD is licensed under CC BY-NC-SA 4.0