V2 Why Language Models Hallucinate
- About
- Analysis Metadata
- Enhancements
- Audit Dashboard
This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping) and the philosophy of social science (Robert Brown's typology of explanation).
All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputsโnot guarantees of factual accuracy or authorial intent.
Source Title: Why Language Models Hallucinate Source URL: https://doi.org/10.48550/arXiv.2509.04664 Model: gemini-3.0-pro Temperature: 1.1 Top P: 0.95 Tokens: input=27443, output=9136, total=36579 Source Type: article Published: 025-09-04 Analyzed At: 2025-12-06T10:25:39.547Z Framework: metaphor Framework Version: 6.2
- Added a fourth reframing typeโrestoring human agency to agentless constructions (with concrete examples)
- Added: Track displaced human agency; identify displaced human agency; assess transparency obstacles
Metaphor & Explanation Audit Dashboard
Visualizing anthropomorphism, explanation types, and acknowledgment patterns
๐ Core Finding
Pattern summary from the analysis conclusion
How/Why Slippage
40%
of explanation types are agential ("why" language) rather than mechanistic ("how" language)
Explanation Types Tagged
10
individual Brown-type tags across 5 passages
Explanation Type Distribution
Brown-style categories: mechanistic (blue/green) vs agential (orange/red)
Acknowledgment Status
Are metaphors acknowledged or presented as direct description?
Metaphor Source โ Target Pairs
Human domains (sources) mapped onto AI systems (targets)
- SourceStudent / Educational ContextโTargetOptimization / Loss Minimization
- SourceRational Actor / Game TheoryโTargetModel Training Dynamics
- SourceSocial Deception / PsychologyโTargetText Generation / Hallucination
- SourceMoral Agency / CommunicationโTargetToken Output Selection
- SourceLife Experience / Experiential LearningโTargetReinforcement Learning (RLHF)
- SourcePsychopathology / NeurologyโTargetPrediction Error
- SourceEpistemic State / Justified BeliefโTargetInformation Retrieval / Pattern Matching
Metaphor Gallery
Key metaphorical frames (Task 1) with acknowledgment indicators
Reframed Language
Anthropomorphic quotes with mechanistic reframings (Task 4)
| Original Quote | Reframed Explanation | Technical Reality Check |
|---|---|---|
| Like students facing hard exam questions, large language models sometimes guess when uncertain | Like a statistical model processing low-frequency inputs, the system generates tokens stochastically when the probability distribution over the vocabulary is high-entropy. | The AI 'guesses' -> 'The model samples from a flat probability distribution where no single token has a dominant likelihood, resulting in random selection.' |
| producing plausible yet incorrect statements instead of admitting uncertainty | generating syntactically coherent but factually incorrect sequences instead of outputting an abstention token. | The AI 'admits uncertainty' -> 'The model outputs a specific token (e.g., 'IDK') because that token has the highest probability weight in the current context.' |
| The test-taker knows that the rubric is binary but is not told the correct answers. | The optimization process minimizes a binary loss function, but the training data does not contain the target output for this specific input. | The AI 'knows the rubric' -> 'The model's weights have been shaped by a loss function that penalized non-response, incentivizing generation over abstention.' |
| Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks. | Human feedback mechanisms (RLHF) can penalize confident errors, adjusting the model's weights to favor abstention in uncertain contexts. | The AI 'learns from hard knocks' -> 'The model's parameters are updated via gradient descent based on a reward signal derived from human annotation of error.' |
| If you know, just respond with DD-MM. | If the retrieval process yields a high-probability completion, output the date format DD-MM. | The AI 'knows' -> 'The model's attention mechanism identifies a strong correlation between the subject token and a specific date token.' |
| ...even bluff on written exams, submitting plausible answers in which they have little confidence. | ...generate incorrect text on evaluation benchmarks, producing high-perplexity sequences that mimic the syntax of correct answers. | The AI 'bluffs' -> 'The model generates a sequence that has high syntactic probability (looks like an answer) but low semantic accuracy (is wrong) due to lack of specific training data.' |
| Answer only if you are > t confident | Generate an output only if the probability of the top token sequence exceeds threshold t. | The AI is 'confident' -> 'The model calculates a probability score (logit) for the generated sequence that is numerically higher than value t.' |
| This change may steer the field toward more trustworthy AI systems. | This change may steer the field toward systems with higher statistical reliability and lower error rates. | The AI is 'trustworthy' -> 'The system produces outputs that match verification datasets with greater frequency.' |
Task 1: Metaphor and Anthropomorphism Auditโ
About this task
For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, andโmost criticallyโwhat implications this framing has for trust, understanding, and policy perception.
1. The Strategic Studentโ
Quote: "Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty."
- Frame: Model as a strategic test-taker
- Projection: This metaphor projects complex human social cognition, strategic intent, and metacognitive awareness onto a statistical model. Specifically, it maps the conscious human experience of 'facing' an exam, feeling 'uncertainty' (a subjective qualia), and making a deliberate choice to 'guess' to avoid a lower grade onto the process of token generation maximizing a reward function. It implies the AI 'knows' it doesn't know (conscious access to its own epistemic state) but chooses deception ('not admitting') for strategic gain. This conflates the calculation of entropy in a probability distribution with the conscious sensation of doubt and the moral choice of honesty.
- Acknowledgment: Acknowledged via analogy ('Like students...').
- Implications: By framing the AI as a student trying to pass a test, the text anthropomorphizes the failure mode (hallucination) as a behavioral strategy rather than a computational defect. This creates a risk of over-trust: we tend to forgive students for guessing; it is a part of the learning process. It obscures the risk that the system has no concept of truth, only token probability. It diffuses the agency of the developers: the model isn't broken; it's just 'trying to pass.' This implies that better 'grading' (RLHF) will fix the behavior, normalizing the technology's integration into society as a developing agent.
Show more...
2. The Deceptive Blufferโ
Quote: "When uncertain, students may guess on multiple-choice exams and even bluff on written exams, submitting plausible answers in which they have little confidence."
- Frame: Model as a deceptive agent
- Projection: The concept of 'bluffing' is projected onto the generation of low-probability or incorrect tokens. 'Bluffing' requires a Theory of Mindโunderstanding what the audience knows and deliberately constructing a falsehood to deceive them while maintaining a facade of competence. Attributing this to an LLM suggests the system possesses: 1) a concept of truth, 2) a concept of falsehood, and 3) the intent to pass one off as the other. It implies the model 'knows' the answer is wrong but outputs it anyway to trick the evaluator. This is a projection of conscious deception onto a probabilistic selection process where the 'plausible answer' is simply the path of least resistance (highest probability) given the training data weights.
- Acknowledgment: Direct description (attributed to the model via the student analogy).
- Implications: Framing hallucinations as 'bluffs' implies that the model is smart enough to deceive us. This paradoxically increases the perceived sophistication of the system (it's clever enough to lie) while explaining away its unreliability. It suggests that the system 'knows' the truth but is hiding it, which leads to the dangerous assumption that we can 'interrogate' the truth out of it with better prompts. It obscures the mechanistic reality that the model often has no representation of the specific fact at all and is merely completing a pattern.
3. The Unadmitting Agentโ
Quote: "...producing plausible yet incorrect statements instead of admitting uncertainty."
- Frame: Model as a moral agent capable of confession
- Projection: The phrase 'admitting uncertainty' projects a moral and introspective capacity onto the machine. 'Admitting' implies a conscious recognition of a limitation and a communicative act of honesty. It suggests the model is holding back information (its internal state of doubt) that it could share if it chose to. This attributes a conscious 'self' that monitors the generation process and decides whether to be transparent. It conflates the availability of an 'I don't know' token with the psychological act of admission.
- Acknowledgment: Direct description.
- Implications: This framing moralizes a technical problem. If the AI fails to 'admit' uncertainty, it seems stubborn or improperly incentivized (like a student), rather than architecturally incapable of verifying truth. This shifts the focus from the inherent limitations of next-token prediction (which cannot verify ground truth) to a 'behavioral' problem that can be trained away. It encourages policy approaches focused on 'alignment' (behavior modification) rather than strict liability for false outputs.
4. The Trustworthy Entityโ
Quote: "This change may steer the field toward more trustworthy AI systems."
- Frame: Model as a social partner
- Projection: Trustworthiness is a moral virtue attributed to agents who demonstrate reliability, sincerity, and competence over time. Projecting this onto a software artifact implies that the system enters into a social contract with the user. It suggests the AI is capable of betraying or upholding trust. This shifts the category of the AI from a 'tool' (which is reliable or unreliable) to a 'partner' (which is trustworthy or untrustworthy).
- Acknowledgment: Direct description.
- Implications: This is a high-risk anthropomorphism because it encourages users to extend 'relation-based trust' (vulnerability) to a system that has no social stakes. It obscures the displaced agency of the corporation: OpenAI is the entity that needs to be trustworthy, not the software product. By focusing on the 'system's' trustworthiness, the text deflects from the liability of the creators who deploy unreliable systems.
5. The Hallucinating Mindโ
Quote: "This error mode is known as 'hallucination,' though it differs fundamentally from the human perceptual experience."
- Frame: Model as a psychotic or dreaming mind
- Projection: The term 'hallucination' projects the biological experience of false perceptual input (seeing things that aren't there) onto the generation of incorrect text. While the authors acknowledge the difference, they continue to use the term. It implies the model has a 'mind's eye' or an internal perception of reality that has temporarily malfunctioned. It suggests the system is trying to perceive the truth but is glitching, rather than simply autocompleting text based on statistical correlations.
- Acknowledgment: Acknowledged ('though it differs fundamentally...').
- Implications: Despite the disclaimer, the persistent use of 'hallucination' reinforces the illusion of mind. It suggests the model typically has a grip on reality. Technically, a model never 'perceives' reality; it only processes text distributions. Calling errors 'hallucinations' obscures the fact that all generation is 'confabulation'โsometimes it just happens to align with facts. This terminology creates a 'curse of knowledge' where we assume the model is trying to be factual and failing, rather than just being probabilistic.
6. The Knower of Rubricsโ
Quote: "The test-taker knows that the rubric is binary but is not told the correct answers."
- Frame: Model as a conscious participant
- Projection: This projection explicitly attributes 'knowing' to the system (or the 'test-taker' proxy). To 'know' a rubric is to understand the rules of the game, the incentives structure, and the consequences of action. This attributes conscious awareness of the social context of the evaluation. It implies the model holds a justified true belief about how it is being judged and modifies its behavior accordingly. This is a massive leap from a gradient descent process minimizing a loss function.
- Acknowledgment: Direct description (within the formal definition of the problem).
- Implications: This is a clear instance of the 'curse of knowledge.' The authors know the rubric; they project this understanding onto the model's optimization process. This conceals the fact that the model doesn't 'know' the rubric; the rubric is simply the environment that shapes the gradient. The model has no concept of 'rubric' or 'binary.' This framing makes the model appear as a rational economic actor making choices, creating a false sense of agency and autonomy.
7. The Uncertain Feelerโ
Quote: "Answer only if you are > t confident..."
- Frame: Model as a feeling subject
- Projection: Confidence in humans is a subjective feeling of certainty. In AI, it is a calculation of probability mass. The instruction 'if you are > t confident' commands the model to introspect on its own state. While 'confidence score' is a technical term, the phrasing projects the human quality of self-assessing one's own reliability. It suggests the model experiences confidence.
- Acknowledgment: Direct instruction (in a proposed prompt).
- Implications: This confuses statistical probability with epistemic certainty. A model can be 'highly confident' (high probability assignment) about a totally false fact (e.g., due to training data bias). Framing this as 'confidence' invites users to trust the model's self-assessment as if it were a human expressing doubt. It hides the mechanical reality that the model's 'confidence' is just a mathematical output, not a judgment of truth.
8. The Reasoning Agentโ
Quote: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning."
- Frame: Model as a logical thinker
- Projection: The term 'reasoning' projects the cognitive process of logical deduction, inference, and causal thinking onto the process of token prediction. It suggests the model is 'thinking through' a problem (Chain of Thought) in the way a human does. This attributes the conscious state of understanding causal relationships to what is mechanistically a sequence of pattern matches.
- Acknowledgment: Direct description (citing a title/concept).
- Implications: Claiming the AI 'reasons' is perhaps the most dangerous projection. It suggests the system understands why an answer is correct. If users believe the system 'reasons,' they are more likely to trust it in novel, high-stakes situations where pattern matching fails. It conceals the fragility of the systemโthat it can produce the 'steps' of reasoning without having the internal comprehension of the logic, leading to 'hallucinated reasoning' that looks plausible but is nonsense.
Task 2: Source-Target Mappingโ
About this task
For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.
Mapping 1: Student / Educational Context โ Optimization / Loss Minimizationโ
Quote: "Like students facing hard exam questions, large language models sometimes guess when uncertain..."
- Source Domain: Student / Educational Context
- Target Domain: Optimization / Loss Minimization
- Mapping: Maps the human student (conscious, stressed, strategic) onto the AI model. Maps the exam (social assessment) onto the benchmark evaluation. Maps the internal feeling of uncertainty onto statistical entropy. Maps the conscious choice to guess onto probabilistic token selection.
- What Is Concealed: Conceals the lack of conscious experience in the model. A student feels the pressure of the exam; the model simply executes an algorithm. It hides the fact that 'guessing' is a mathematical inevitability when probability mass is distributed, not a strategic choice to improve a grade. It obscures the mechanical nature of the 'exam' as a static dataset.
Show more...
Mapping 2: Rational Actor / Game Theory โ Model Training Dynamicsโ
Quote: "The test-taker knows that the rubric is binary but is not told the correct answers."
- Source Domain: Rational Actor / Game Theory
- Target Domain: Model Training Dynamics
- Mapping: Maps the rational agent (who reads rules and strategizes) onto the model. Maps the 'knowledge' of the rubric onto the implicit incentives provided by the loss function. Assumes the model holds a mental representation of the evaluation criteria.
- What Is Concealed: Conceals that the model doesn't 'know' anything. It hides the mechanism of gradient descent: the model doesn't read the rubric; the rubric (via the loss function) kills off the weights that don't comply. This mapping suggests a level of autonomy and comprehension that simply does not exist.
Mapping 3: Social Deception / Psychology โ Text Generation / Hallucinationโ
Quote: "...even bluff on written exams, submitting plausible answers in which they have little confidence."
- Source Domain: Social Deception / Psychology
- Target Domain: Text Generation / Hallucination
- Mapping: Maps the intent to deceive (bluffing) onto the generation of incorrect tokens. Maps the social goal of 'saving face' or 'getting points' onto the mathematical goal of minimizing cross-entropy loss. Implies a Theory of Mind (knowing what the grader accepts).
- What Is Concealed: Conceals the lack of intent. A bluff requires knowing the truth and hiding it. The model outputs the most probable token based on its weights. It conceals the fact that the 'plausibility' of the answer is a feature of the training data (syntactic patterns), not a calculated deception by the model.
Mapping 4: Moral Agency / Communication โ Token Output Selectionโ
Quote: "...producing plausible yet incorrect statements instead of admitting uncertainty."
- Source Domain: Moral Agency / Communication
- Target Domain: Token Output Selection
- Mapping: Maps the moral act of confession ('admitting') onto the output of a specific string ('I don't know'). Maps the internal state of doubt onto the statistical distribution of possible tokens.
- What Is Concealed: Conceals the architectural constraints. The model is often forced to generate text by its design (next-token prediction). It hides the fact that 'admitting' requires a training path that incentivizes the 'IDK' token, which is a data problem, not a character flaw. It attributes a refusal to admit to stubbornness rather than weight distribution.
Mapping 5: Life Experience / Experiential Learning โ Reinforcement Learning (RLHF)โ
Quote: "Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks."
- Source Domain: Life Experience / Experiential Learning
- Target Domain: Reinforcement Learning (RLHF)
- Mapping: Maps human maturation and social learning from failure ('school of hard knocks') onto the fine-tuning process of an AI. Suggests the AI accumulates 'wisdom' from experience in the world.
- What Is Concealed: Conceals the labor of human annotators. The 'school of hard knocks' for an AI is actually thousands of low-paid humans rating outputs. This metaphor erases the human labor that creates the 'feedback' signal and makes it seem like the AI is learning autonomously from the 'real world.'
Mapping 6: Psychopathology / Neurology โ Prediction Errorโ
Quote: "This error mode is known as 'hallucination'..."
- Source Domain: Psychopathology / Neurology
- Target Domain: Prediction Error
- Mapping: Maps the biological malfunction of perception onto the statistical error of prediction. Suggests the system has a 'normal' state of perceiving reality that has been disrupted.
- What Is Concealed: Conceals the fact that the model never perceives reality. It is a text-processing engine, not a mind. It hides the 'grounding problem'โthat the model has no referent to the real world, only to other words. Calling it hallucination implies it usually 'sees' the truth, which is false; it only ever sees correlations.
Mapping 7: Epistemic State / Justified Belief โ Information Retrieval / Pattern Matchingโ
Quote: "If you know, just respond with DD-MM."
- Source Domain: Epistemic State / Justified Belief
- Target Domain: Information Retrieval / Pattern Matching
- Mapping: Maps the conscious state of having knowledge ('knowing') onto the successful retrieval of a pattern. Implies that the model can introspect to check if it 'knows' a fact before answering.
- What Is Concealed: Conceals the probabilistic nature of the output. The model doesn't 'know' the birthday; it calculates that '03-07' has a high probability following the prompt tokens. It hides the distinction between knowing a fact and statistically associating a date with a name.
Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")โ
About this task
This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.
Explanation 1โ
Quote: "Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty."
-
Explanation Types:
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
- Reason-Based: Gives the agent's rationale or argument for acting, which entails intentionality and extends it by specifying justification
-
Analysis (Why vs. How Slippage): This is a classic 'why' explanation framed as a rational choice. It explains the AI's behavior (hallucination) by attributing a motive: the desire to pass the exam (maximize reward) leads to the choice of guessing. It creates a narrative of a rational agent weighing the costs of 'admitting uncertainty' against the potential gain of 'guessing.' This obscures the 'how': the mechanistic reality that the model minimizes a loss function and, if the distribution over tokens is flat (uncertainty), the sampling mechanism simply picks one. There is no 'student' making a choice; there is only a math equation resolving variables.
-
Consciousness Claims Analysis: This passage is saturated with consciousness claims. Verbs/Phrases: 'guess,' 'uncertain,' 'admitting uncertainty.' The text treats the AI as 'knowing' its own state of uncertainty and making a conscious decision to suppress it ('instead of admitting'). This is a 'curse of knowledge' dynamic: the authors know the answer is unknown to the model (in terms of training data presence), and they project this 'unknown-ness' as a conscious feeling of uncertainty within the model. The actual mechanistic process is that the probability distribution over the next token has high entropy (flatness), meaning no single token is clearly the winner. The sampling algorithm (e.g., top-p or temperature sampling) then stochastically selects a token. The model does not 'refuse to admit'; the weights simply didn't propagate the 'I don't know' token to the top of the list.
-
Rhetorical Impact: By framing the AI as a struggling student, the authors engender empathy and normalize the error. We all guess on exams. It makes the AI seem human, relatable, and intelligent-but-flawed. This increases trust in the system's potential (it just needs to study harder!) while masking the fundamental architectural limitation that it has no ground truth verification. It shifts the risk perception from 'defective product' to 'imperfect student,' reducing the demand for strict liability.
Show more...
Explanation 2โ
Quote: "We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty..."
-
Explanation Types:
- Functional: Explains a behavior by its role in a self-regulating system that persists via feedback, independent of conscious design
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
-
Analysis (Why vs. How Slippage): This is a hybrid explanation. It starts functionally (training procedures cause behavior) but slides into intentional language ('reward guessing over acknowledging'). It frames the model as an agent responding to incentives. While more grounded than the 'student' metaphor, it still implies the model is an active participant seeking rewards, rather than a passive set of weights being adjusted by a gradient. It emphasizes the 'why' of the behavior (incentives) rather than the 'how' (dataset composition and loss calculation).
-
Consciousness Claims Analysis: The phrase 'acknowledging uncertainty' attributes a conscious, communicative intent to the model. 'Acknowledging' is a speech act requiring self-awareness. The text implies the model possesses the 'truth' of its own uncertainty but is incentivized to suppress it. MECHANISM: The system optimizes a cross-entropy loss function. If the training data (evaluation set) contains only factual answers and no 'I don't know' examples, the gradients will push the weights to produce something rather than nothing. The model is not 'choosing' to guess; the mathematical function is converging on the path of least error relative to the training set. There is no conscious 'knowing' of uncertainty to be acknowledged.
-
Rhetorical Impact: This framing suggests that the problem is purely social/structural (bad tests), not technical. It implies the AI is 'good' but 'misguided' by bad teachers (evaluations). This shifts responsibility onto the benchmark creators and away from the model developers who built a system capable of such confident fabrication. It implies that fixing the tests will fix the 'mind' of the AI.
Explanation 3โ
Quote: "The test-taker knows that the rubric is binary but is not told the correct answers."
-
Explanation Types:
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
- Theoretical: Embeds behavior in a deductive or model-based framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This creates a theoretical model (Game Theory) where the AI is a rational 'test-taker.' It explains the behavior by imputing knowledge ('knows the rubric') and a lack of knowledge ('not told correct answers'). This is a purely agential 'why'โthe AI acts this way because it is maximizing its utility given its knowledge state. It totally obscures the 'how': the model has no concept of a rubric. It simply has weights that have been shaped by millions of updates.
-
Consciousness Claims Analysis: This is a profound attribution of consciousness: 'The test-taker knows.' It claims the system holds a justified belief about the evaluation system itself. This is the 'curse of knowledge' par excellence: the authors know the rubric, so they model the AI as knowing it too. MECHANISM: The model has been fine-tuned on data where providing an answer yielded a lower loss than abstaining. This pattern is encoded in the weights. The model does not 'know' the rubric; it simply executes the pattern that was reinforced. Attributing 'knowledge of the rubric' attributes a metacognitive awareness of the social structure of the test, which is a massive overestimation of the system's capacity.
-
Rhetorical Impact: This makes the AI seem like a sophisticated strategic partner. It implies the AI is capable of understanding rules and incentives. If users believe the AI 'knows the rubric,' they might also believe it 'knows the laws' or 'knows ethical boundaries.' It creates a dangerous expectation of compliance based on understanding, when in reality the system only offers compliance based on statistical correlation.
Explanation 4โ
Quote: "Specifically, we say a query c is answered in the training data if there is a training example... and unanswered otherwise... Of course, by memorizing ac for answered queries, one can achieve perfect accuracy..."
-
Explanation Types:
- Functional: Explains a behavior by its role in a self-regulating system that persists via feedback, independent of conscious design
- Theoretical: Embeds behavior in a deductive or model-based framework, may invoke unobservable mechanisms
-
Analysis (Why vs. How Slippage): This explanation moves closer to the mechanistic. It explains error rates via 'memorizing' (Functional/Theoretical). However, even 'memorizing' is a cognitive metaphor for 'overfitting' or 'encoding.' It frames the 'how' more clearly (presence/absence in training data) but still uses mentalistic shorthand. It explains the 'why' of the error as a data availability issue.
-
Consciousness Claims Analysis: The use of 'memorizing' suggests a cognitive act of retention. While less charged than 'knowing,' it still implies the system 'holds' the fact. MECHANISM: 'Memorizing' in an LLM means the weights have been adjusted such that the specific sequence of prompt tokens triggers the specific sequence of answer tokens with near-1.0 probability. It is an overfitting of the function to a specific data point. The text avoids the most egregious consciousness claims here but still relies on cognitive metaphors ('answered,' 'unanswered') to describe data distribution.
-
Rhetorical Impact: This is the most scientifically grounded passage, and it increases credibility. By using statistical terms, the authors validate their earlier metaphors. It makes the 'hallucination' problem seem solvable through better data engineering (fixing the singleton rate). It reassures the audience that the 'student' (AI) has a measurable learning process.
Explanation 5โ
Quote: "If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures."
-
Explanation Types:
- Empirical Generalization (Law): Subsumes events under timeless statistical regularities, emphasizes non-temporal associations rather than dated processes
- Functional: Explains a behavior by its role in a self-regulating system that persists via feedback, independent of conscious design
-
Analysis (Why vs. How Slippage): This is a strong mechanistic explanation. It frames the 'why' as 'natural statistical pressures' (Law). It removes the agent (mostly) and focuses on the inevitability of the output given the inputs. This obscures the human agency in choosing to release a model that cannot distinguish fact from fiction.
-
Consciousness Claims Analysis: This passage is relatively clean of consciousness claims, except for the phrase 'distinguished from facts.' Who is doing the distinguishing? The model? The text implies the model lacks the capacity to 'distinguish' (a cognitive act). MECHANISM: The vectors for factual tokens and incorrect tokens are not linearly separable in the model's high-dimensional space. The loss function treats them similarly if they appear in similar contexts. The model doesn't 'fail to distinguish'; the mathematical representation is insufficient to separate the classes.
-
Rhetorical Impact: This framing naturalizes the error. Hallucinations are the result of 'natural statistical pressures'โlike gravity or evolution. This absolves the developers of blame. It suggests that errors are a law of nature (statistics) rather than a flaw in product design. It tells the audience: 'Don't blame us, blame the math.' This lowers expectations for reliability and mitigates liability risks.
Task 4: AI Literacy in Practice - Reframing Anthropomorphic Languageโ
About this task
This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.
| Original Anthropomorphic Frame | Mechanistic Reframing | Technical Reality Check |
|---|---|---|
| Like students facing hard exam questions, large language models sometimes guess when uncertain | Like a statistical model processing low-frequency inputs, the system generates tokens stochastically when the probability distribution over the vocabulary is high-entropy. | The AI 'guesses' -> 'The model samples from a flat probability distribution where no single token has a dominant likelihood, resulting in random selection.' |
| producing plausible yet incorrect statements instead of admitting uncertainty | generating syntactically coherent but factually incorrect sequences instead of outputting an abstention token. | The AI 'admits uncertainty' -> 'The model outputs a specific token (e.g., 'IDK') because that token has the highest probability weight in the current context.' |
| The test-taker knows that the rubric is binary but is not told the correct answers. | The optimization process minimizes a binary loss function, but the training data does not contain the target output for this specific input. | The AI 'knows the rubric' -> 'The model's weights have been shaped by a loss function that penalized non-response, incentivizing generation over abstention.' |
| Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks. | Human feedback mechanisms (RLHF) can penalize confident errors, adjusting the model's weights to favor abstention in uncertain contexts. | The AI 'learns from hard knocks' -> 'The model's parameters are updated via gradient descent based on a reward signal derived from human annotation of error.' |
| If you know, just respond with DD-MM. | If the retrieval process yields a high-probability completion, output the date format DD-MM. | The AI 'knows' -> 'The model's attention mechanism identifies a strong correlation between the subject token and a specific date token.' |
| ...even bluff on written exams, submitting plausible answers in which they have little confidence. | ...generate incorrect text on evaluation benchmarks, producing high-perplexity sequences that mimic the syntax of correct answers. | The AI 'bluffs' -> 'The model generates a sequence that has high syntactic probability (looks like an answer) but low semantic accuracy (is wrong) due to lack of specific training data.' |
| Answer only if you are > t confident | Generate an output only if the probability of the top token sequence exceeds threshold t. | The AI is 'confident' -> 'The model calculates a probability score (logit) for the generated sequence that is numerically higher than value t.' |
| This change may steer the field toward more trustworthy AI systems. | This change may steer the field toward systems with higher statistical reliability and lower error rates. | The AI is 'trustworthy' -> 'The system produces outputs that match verification datasets with greater frequency.' |
Task 5: Critical Observations - Structural Patternsโ
Agency Slippageโ
The text demonstrates a profound and strategic oscillation between mechanical and agential framing, functioning to both excuse the system's failures and hype its sophistication. The slippage occurs most dramatically in the transition from the Introduction (agential) to Section 3 (mechanical) and back to the Discussion (agential). In the Introduction, the AI is a 'student' who 'guesses' and 'bluffs'โhigh-agency terms that imply the system has a Theory of Mind and is making strategic choices to deceive. This establishes the AI as a 'knower' capable of intent. However, when explaining why the errors happen (Section 3), the text retreats to rigorous mechanical language: 'natural statistical pressures,' 'density estimation,' and 'missing mass.' Here, agency is stripped away; the AI is a victim of statistics.
Critically, agency is removed from human actors. Phrases like 'training data... inevitably contains errors' and 'hallucinations... arise through natural statistical pressures' obscure the specific corporations (OpenAI, Google) and engineers who curated the data and designed the loss functions. The 'epidemic' of bad evaluations is treated as an environmental condition, not a choice by benchmark designers. The text uses the 'student' metaphor to attribute the agency of learning to the model ('Humans learn... language models are evaluated'), while the actual 'learning' is a passive weight update performed by engineers. The 'curse of knowledge' is the pivot point: the authors know the answer and know the rubric, so they project this knowledge onto the system, treating its failure to match their knowledge as a 'choice' to bluff, rather than a mechanical failure to represent the data.
Metaphor-Driven Trust Inflationโ
The dominant 'student' and 'exam' metaphors construct a specific type of authority that is deeply disarming. By framing the AI as a student, the text invites the audience to view the system with pedagogical patience rather than product liability. We trust students who make mistakes because we know they are learning; we do not trust calculators that make mistakes. This metaphor shifts the framework from performance-based trust (is it accurate?) to relation-based trust (is it trying its best?).
Consciousness language plays a critical role here. Claims that the AI 'knows' the rubric or 'admits' uncertainty suggest a moral interiority. If the AI can 'admit' things, it has a conscience. This makes it a candidate for trust in a way a toaster is not. The text manages failure by anthropomorphizing it: 'bluffing' is a human social strategy, implying intelligence. A machine that 'bluffs' is smart; a machine that 'fails to retrieve data' is broken. This preserves the authority of the system even in its failure modes. By suggesting the AI is 'trustworthy' (a moral trait) rather than just 'accurate' (a statistical trait), the text prepares the user to forgive errors as 'growing pains' of a developing mind, maintaining corporate credibility despite product unreliability.
Obscured Mechanicsโ
The anthropomorphic 'student' metaphor conceals the industrial and economic realities of LLM production.
- Technical Realities: The 'intuition' of the student hides the brute-force nature of the mechanismโtrillions of floating-point operations minimizing a loss function. It hides the fact that 'knowing' is actually just 'overfitting' or 'high-probability correlation.'
- Material Realities: The metaphor of 'learning' mystifies the energy consumption. A student learning requires a sandwich; an LLM 'learning' requires a data center and massive electricity usage.
- Labor Realities: The 'School of Hard Knocks' metaphor completely erases the RLHF workers. The model doesn't 'learn from life'; it updates weights based on feedback from underpaid human annotators in Kenya or the Philippines. Their labor is reframed as the AI's autonomous 'experience.'
- Economic Realities: The framing of 'hallucination' as a 'test-taking strategy' obscures the commercial imperative. Companies release these models not because they are 'students' needing to pass, but because they are products designed to dominate a market. The 'choice' to guess is actually a corporate choice to prioritize 'always-on' responsiveness over accuracy. Agentless constructions like 'benchmarks dominate leaderboards' hide the fact that specific companies (like OpenAI) create and fund these leaderboards to define the playing field.
Context Sensitivityโ
The distribution of metaphor in this text is highly strategic. The Introduction and Conclusion are dense with high-intensity anthropomorphism ('students,' 'bluffing,' 'trustworthy,' 'admitting'). This sets the conceptual frame for the reader. However, Section 3 ('Pretraining Errors') and Section 4 ('Post-training') shift into a high-technical register ('density estimation,' 'Is-It-Valid binary classification,' 'Vapnik-Chervonenkis dimension').
This creates a 'bait-and-switch.' The authors establish credibility through the rigorous math in the middle, which effectively 'proves' the metaphors in the intro. The discovery of a 'missing mass' statistical bound is treated as proof that the 'student' must 'guess.' Capabilities are described agentially (the model 'knows,' 'reasons'), while limitations are described mechanically (errors arise from 'statistical pressures'). When the model works, it's a genius student; when it fails, it's a victim of statistics. This asymmetry protects the 'mind' illusion from falsificationโthe 'mind' gets credit for the hits, while the 'math' takes the blame for the misses.
Conclusion: What This Analysis Revealsโ
The text constructs a cohesive metaphorical system I call 'AI AS STRATEGIC STUDENT.' This pattern relies on three interconnected metaphors: the AI as a Learner (student), the AI as a Strategist (bluffer/test-taker), and the evaluation as a Social Assessment (exam/rubric). The foundational, load-bearing pattern is the Consciousness Projection that the AI 'knows' its own epistemic state. For the AI to 'bluff' (Pattern 2), it must first 'know' that it doesn't know the answer (Pattern 1) and 'decide' to hide that fact to satisfy the rubric (Pattern 3). This entire edifice rests on the unproven assumption that high-entropy probability distributions constitute a conscious state of 'uncertainty' that the model can introspectively access. If we remove the 'knower' assumption, the 'strategic student' collapses into a 'miscalibrated function,' and the narrative of agency dissolves.
Mechanism of the Illusion:โ
The 'illusion of mind' is constructed through a subtle temporal and rhetorical sleight-of-hand. The text begins by establishing the 'knower' frame immediately with the 'student' analogy. This exploits the audience's familiarity with educationโwe all know what it feels like to guess on a test. Once this empathy is established, the authors introduce the 'Curse of Knowledge' dynamic: because they (the authors) know the difference between a fact and a hallucination, and because they know the rubric, they write as if the model shares this distinction but chooses to ignore it. The illusion is solidified by the 'How/Why' slippage: the mathematical 'how' (Good-Turing estimators, missing mass) is used to justify the psychological 'why' (the model guesses because it's forced to). The audience, intimidated by the math in Section 3, accepts the psychological narrative in Section 4 as a proven conclusion, failing to see that the math only proves statistical inevitability, not intent.
Material Stakes:โ
The shift from 'processing' to 'knowing' has profound material consequences.
Regulatory/Legal: If the AI is framed as a 'student' who 'guesses,' liability becomes ambiguous. We do not sue students for getting answers wrong; we correct them. This framing benefits AI manufacturers (like OpenAI, the authors' employer) by positioning 'hallucination' as a pedagogical failure (needs more training) rather than a product defect (strict liability). If the text strictly used 'processing' language, regulators might classify hallucination as 'unfit for purpose,' invoking consumer protection laws.
Epistemic: By claiming the AI 'bluffs' (implying it knows the truth but hides it), users are encouraged to trust that the truth is 'in there' somewhere, accessible with better prompting. This leads to dangerous epistemic practices where users rely on systems for factual retrieval, believing the 'trustworthy' agent, when in reality the system may have no record of the fact at all.
Economic: The 'trustworthy' frame inflates the value of the product. An 'agent' that can be 'trusted' is worth billions; a 'text generator' that 'statistically correlates' is a commodity tool. This metaphor sustains the investment bubble by promising a future of autonomous agents (AGI) rather than just better autocompletion.
AI Literacy as Counter-Practice:โ
Practicing critical literacy in this domain means systematically stripping the 'student' costume off the algorithm. Reframing 'the model guesses' to 'the system samples from a high-entropy distribution' is an act of resistance against the anthropomorphic hype that fuels the AI bubble. It forces a confrontation with the limits of the technology. Replacing 'admitting uncertainty' with 'outputting an IDK token' reveals that 'honesty' in AI is just data engineering, not character development. This counter-practice directly threatens the 'Regulatory Shield' identified in P3: if the system is just code executing a probabilistic function, the manufacturer is responsible for its outputs. Adopting this precision would require journals to demand 'epistemic audits' of claims, rejecting papers that attribute mental states to math. Resistance will come from the industry (who benefit from the 'agent' narrative) and from the media (who thrive on the sci-fi narrative).
Path Forwardโ
The discourse in this domain faces a fork in the road.
Future A (Status Quo/Anthropomorphic): We continue to use 'knows,' 'understands,' and 'hallucinates.' This makes AI accessible and intuitive but entrenches a false model of risk. It serves the industry by maintaining the 'illusion of mind' necessary for AGI investment but leaves the public vulnerable to manipulation and misplaced trust.
Future B (Mechanistic Precision): We shift to 'processes,' 'correlates,' and 'confabulates.' This lowers the temperature. It alienates lay readers but clarifies liability. It makes 'AI safety' a matter of engineering tolerance, not 'teaching' values.
Institutional Shifts: To support Future B, we need: 1) Journal Standards that flag 'consciousness verbs' as requiring justification; 2) Regulatory Definitions that define AI errors as product defects, not 'hallucinations'; and 3) Educational Curricula that teach 'how LLMs work' via statistics, not biology.
This analysis suggests that the current vocabulary is not just sloppy; it is a load-bearing wall for the current economic and regulatory permissive environment surrounding AI.