Skip to main content

๐Ÿ“Š Why Language Models Hallucinate

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputsโ€”not guarantees of factual accuracy or authorial intent.


Task 1: Metaphor and Anthropomorphism Auditโ€‹

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, andโ€”most criticallyโ€”what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. The Student taking an Examโ€‹

Quote: "Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty."

  • Frame: Model as student / Evaluation as exam
  • Projection: This metaphor projects the entire sociotechnical apparatus of human education onto statistical data processing. It suggests the model possesses an internal psychological state of 'uncertainty' that it consciously chooses to suppress in favor of 'guessing' to maximize a grade. It implies the system has agency, a desire to succeed, and the capacity for meta-cognition (knowing that it does not know). By framing the AI as a 'student,' the text invokes a developmental trajectory, suggesting that errors are part of a learning curve rather than permanent features of a probabilistic architecture.
  • Acknowledgment: Explicitly Acknowledged (The text opens with the simile marker 'Like students facing hard exam questions,' explicitly setting up the analogy before extending it into the structural argument of the paper.)
  • Implications: Framing the AI as a student explicitly shifts the burden of performance onto the system's 'effort' or 'learning' rather than the manufacturer's design. If an AI is a student, errors are 'learning opportunities' or result from 'bad testing,' rather than product defects. This heavily inflates the perceived sophistication of the system, suggesting it has the cognitive architecture to 'take a test' rather than simply pattern-match against a validation set. It risks policy environments treating AI development as pedagogy rather than software engineering.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The construction 'producing plausible yet incorrect statements' and 'instead of admitting uncertainty' places the agency on the model. The 'evaluation procedures' are described as the active agent that 'rewards guessing,' obscuring the human researchers (including the authors at OpenAI) who designed the loss functions, selected the training data, and established the reinforcement learning protocols that enforce this behavior.
Show more...

2. The Strategic Blufferโ€‹

Quote: "When uncertain, students may guess on multiple-choice exams and even bluff on written exams, submitting plausible answers in which they have little confidence... Bluffs are often overconfident and specific"

  • Frame: Probabilistic error as intentional deception
  • Projection: This extends the student metaphor to attribute specific intent: the intent to deceive ('bluff') to save face or gain points. It projects a 'theory of mind' onto the model, suggesting it understands the social game of testing and chooses a deceptive strategy. It conflates low-probability token generation (mechanistic) with the complex social and psychological act of bluffing (agential), which requires knowing the truth, knowing the audience doesn't know, and intending to mislead.
  • Acknowledgment: Direct (Unacknowledged) (While the paragraph starts with an analogy to students, the text pivots to describing model outputs directly as 'Bluffs' without qualification in the subsequent sentence: 'Bluffs are often overconfident and specific'.)
  • Implications: Calling a hallucination a 'bluff' implies a failure of character or alignment ('honesty') rather than a failure of statistical grounding. It suggests the model 'knows' the truth but hides it. This creates unwarranted trust that if we simply 'align' the model (teach it to be honest), the problem vanishes. It obscures the risk that the model effectively 'believes' its own hallucinations because it has no ground truth access, only token probabilities.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The text treats the bluffing behavior as an emergent property of the 'test-taking' dynamic. It obscures the specific engineering choices in RLHF (Reinforcement Learning from Human Feedback) where human annotators may have positively reinforced confident-sounding answers, thereby explicitly training the model to 'bluff.' The agency is displaced onto the 'school of hard knocks' vs. 'exams' dichotomy.

3. Admitting Uncertaintyโ€‹

Quote: "producing plausible yet incorrect statements instead of admitting uncertainty."

  • Frame: Outputting low-confidence scores as 'confession'
  • Projection: This attributes a conscious epistemic state ('uncertainty') and a communicative intent ('admitting') to the system. It implies the model possesses a private, internal state of knowledge where it 'knows' it is unsure, and faces a choice of whether to reveal that state. Mechanistically, the model calculates probability distributions; it does not 'feel' uncertain nor does it have a self-concept to 'admit' anything to.
  • Acknowledgment: Direct (Unacknowledged) (The phrase is used literally to describe the system's failure mode. There are no scare quotes around 'admitting' or 'uncertainty' in this context.)
  • Implications: This is a critical epistemic distortion. If users believe the model can 'admit' uncertainty, they will assume that when it doesn't admit it, the model is 'certain' (and therefore correct). This dangerously inflates trust in the model's confident errors. It treats the absence of an 'I don't know' token as a guarantee of factual accuracy, ignoring that the model can be statistically confident about a hallucination.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The construction suggests the model refuses to admit uncertainty. This obscures the designers' decision to suppress refusal tokens (like 'I don't know') in favor of helpfulness/completion during fine-tuning. The authors (OpenAI researchers) are analyzing a behavior that their organization's engineering practices likely instilled.

4. Optimized Test-Takersโ€‹

Quote: "language models are optimized to be good test-takers, and guessing when uncertain improves test performance."

  • Frame: Optimization as studying/skill-acquisition
  • Projection: This projects the goal-oriented behavior of a human maximizing a GPA onto the mathematical minimization of loss functions. It implies the model has a desire to be 'good' at the test. While 'optimized' is a technical term, linking it to 'good test-takers' anthropomorphizes the result, suggesting the model is gaming the system rather than simply descending a gradient defined by the developers.
  • Acknowledgment: Direct (Unacknowledged) (The statement is presented as a factual description of the state of the art. The metaphor of the 'test-taker' is fused with the technical term 'optimized'.)
  • Implications: This framing normalizes the disconnect between benchmarks and real-world utility. By framing models as 'test-takers,' it trivializes the failure modes as 'gaming the stats' rather than fundamental reliability issues. It suggests the solution is simply 'better tests' (pedagogical reform) rather than questioning whether the statistical architecture can ever be truthful.

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: Passive voice ('are optimized') hides the optimizer. The text identifies 'benchmarks' and 'evaluation procedures' as the driving forces, rather than the specific corporations (OpenAI, Google, Meta) and research leads who decided to use those benchmarks as the primary signal for deployment readiness.

5. Hallucination as Epidemicโ€‹

Quote: "This 'epidemic' of penalizing uncertain responses can only be addressed through a socio-technical mitigation"

  • Frame: Engineering choice as public health crisis
  • Projection: Using the metaphor of an 'epidemic' treats a deliberate design choice (penalizing 'I don't know' responses) as a contagion or natural disaster that has befallen the field. It removes the element of choice. An epidemic spreads largely beyond human control; engineering metrics are chosen by specific actors.
  • Acknowledgment: Explicitly Acknowledged (The word 'epidemic' is placed in scare quotes, indicating the authors recognize it is a metaphorical application of the term.)
  • Implications: This biological/viral metaphor diffuses responsibility. It suggests that 'bad evaluations' are spreading like a virus, rather than being adopted by specific institutions. It positions the authors (and their company, OpenAI) as doctors fighting a disease, rather than engineers who helped design the environment in which this 'disease' thrives.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The 'epidemic' is the subject. The actors who are 'penalizing uncertain responses'โ€”the creators of the benchmarks and the model trainers who optimize for themโ€”are not named. The 'field' is the implied victim/patient.

6. Intrinsic vs. Extrinsic Hallucinationโ€‹

Quote: "distinguish intrinsic hallucinations that contradict the userโ€™s prompt... [from] extrinsic hallucinations, which contradict the training data or external reality."

  • Frame: Data discrepancy as cognitive disorder
  • Projection: Retaining the psychiatric term 'hallucination' projects a mind-body dualism. 'Intrinsic' implies an internal mental conflict, while 'extrinsic' implies a break with reality. In a machine, these are simply data processing errorsโ€”one contradicts the context window (prompt), the other contradicts the weights (training data). There is no 'internal' or 'external' reality for the model, only tokens.
  • Acknowledgment: Direct (Unacknowledged) (The terms are used as standard technical taxonomy labels (citing Maynez et al., 2020) without qualifying the metaphorical baggage of 'hallucination' in this specific context.)
  • Implications: This cements the 'mind' metaphor. By classifying hallucinations into types, it mimics psychiatric diagnosis. It implies the model has a 'grasp' of reality that it is failing to maintain. It obscures the fact that the model has no access to 'external reality' at allโ€”it only has statistical correlations between tokens.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The agent is the 'hallucination' itself or the model. This taxonomy deflects from the source of the error: the training data curation (human agency) or the architectural limitation (design agency). It treats the error as a pathology of the organism.

7. Trustworthy AI Systemsโ€‹

Quote: "This change may steer the field toward more trustworthy AI systems."

  • Frame: Reliability as moral character
  • Projection: Trustworthiness is a human moral quality involving honesty, integrity, and consistency. Applying it to an AI system implies the system can be 'worthy' of a human relationship. It shifts the focus from 'reliable' or 'accurate' (performance metrics) to 'trustworthy' (relational attribute), suggesting the system is a partner.
  • Acknowledgment: Direct (Unacknowledged) (Used as a standard goal/metric. No hedging.)
  • Implications: This is the ultimate goal of the anthropomorphic project: to establish the AI as a valid social actor. If the system is 'trustworthy,' humans are encouraged to offload critical judgment to it. It obscures the liability questionโ€”if a 'trustworthy' system fails, is it a betrayal (social) or a malfunction (legal/product)?

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The 'field' is being steered. The 'systems' become trustworthy. The human corporate actors who define what counts as 'trustworthy' (often defining it as 'safety' or 'alignment' rather than 'truth') are invisible. It obscures the profit motive in branding a product as 'trustworthy.'

8. Learning the Valueโ€‹

Quote: "Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks."

  • Frame: Reinforcement Learning as Life Experience
  • Projection: This explicitly maps the human experience of maturing through social pain ('school of hard knocks') onto the model's training process. It suggests that if we just give the model the right 'life experiences' (RLHF penalties), it will develop the wisdom to be humble. It anthropomorphizes the mathematical penalty term as a 'hard knock' teaching a life lesson.
  • Acknowledgment: Explicitly Acknowledged (The phrase 'school of hard knocks' is a colloquialism/idiom used metaphorically to contrast with formal 'exams' (benchmarks).)
  • Implications: This suggests that AI errors are due to a lack of 'experience' or 'maturity' rather than fundamental limitations. It implies that with enough 'hard knocks' (fine-tuning), the model will attain wisdom. It obscures the fact that the model doesn't 'learn a value'โ€”it just adjusts weights to minimize a penalty score.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The 'school of hard knocks' is an agentless environment. In reality, the 'knocks' are administered by low-wage data annotators following specific instructions written by engineers. This metaphor erases the labor relations of RLHF.

Task 2: Source-Target Mappingโ€‹

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: Pedagogy / Student Psychology โ†’ Statistical Inference / Token Predictionโ€‹

Quote: "Like students facing hard exam questions, large language models sometimes guess when uncertain"

  • Source Domain: Pedagogy / Student Psychology
  • Target Domain: Statistical Inference / Token Prediction
  • Mapping: The mapping projects the internal psychological state of a student (anxiety, uncertainty, desire to pass, strategic guessing) onto the statistical operations of a neural network. The 'exam' maps to the evaluation benchmark; the 'grade' maps to the accuracy metric; 'guessing' maps to sampling from a probability distribution where the top token has low probability mass.
  • What Is Concealed: This mapping conceals the total absence of self-awareness in the model. A student knows they are taking a test and cares about the outcome. The model simply executes a matrix multiplication. The metaphor hides the fact that 'guessing' is the only thing the model doesโ€”it is always predicting the next token based on probability. There is no distinction in the machine between 'knowing' and 'guessing'; there is only high probability and low probability.
Show more...

Mapping 2: Social interaction / Game theory (Poker) โ†’ Low-entropy generation of incorrect tokensโ€‹

Quote: "Bluffs are often overconfident and specific"

  • Source Domain: Social interaction / Game theory (Poker)
  • Target Domain: Low-entropy generation of incorrect tokens
  • Mapping: Maps the human act of intentional deception (pretending to hold a card/fact one does not have) onto the model's generation of high-confidence scores for incorrect tokens. It assumes a duality: the model 'knows' the truth but 'chooses' to present a falsehood with confidence to win the game.
  • What Is Concealed: It conceals the mechanistic reality that 'confidence' in an LLM is merely the log-probability of the next token. High confidence on a hallucination is not a 'bluff'; it is a statistical artifact where the training data created a strong correlation between a context and a false completion. The model cannot 'intend' to deceive because it has no concept of truth or falsehood, only likelihood.

Mapping 3: Interpersonal Communication / Confession โ†’ Token generation vs. Rejection samplingโ€‹

Quote: "producing plausible yet incorrect statements instead of admitting uncertainty"

  • Source Domain: Interpersonal Communication / Confession
  • Target Domain: Token generation vs. Rejection sampling
  • Mapping: Projects the human capacity for introspection and verbal confession onto the output of specific tokens (e.g., 'I don't know'). 'Admitting' implies the system accesses a truth about its own state and chooses to verbalize it. 'Uncertainty' maps to entropy or low log-probs.
  • What Is Concealed: Conceals that 'admitting uncertainty' is just generating the token string 'I don't know' because it was statistically probable in that context (or enforced by RLHF). It hides the fact that the model does not 'feel' uncertain. It also hides the engineering decisions that often punish 'I don't know' responses to make the model seem more 'helpful' or 'smart,' creating the very behavior being criticized.

Mapping 4: Academic Achievement / Skill Acquisition โ†’ Hyperparameter tuning / Loss minimizationโ€‹

Quote: "language models are optimized to be good test-takers"

  • Source Domain: Academic Achievement / Skill Acquisition
  • Target Domain: Hyperparameter tuning / Loss minimization
  • Mapping: Maps the student's journey of studying and skill acquisition onto the process of gradient descent and RLHF. 'Optimized' here implies a training regimen designed to pass a specific metric. The 'test-taker' persona implies the model is an agent navigating an assessment landscape.
  • What Is Concealed: Obscures the lack of agency. A student tries to be a good test-taker. A model is forced by the mathematical constraints of the loss function to minimize error on the validation set. It conceals the problem of 'overfitting' or 'Goodhart's Law' by framing it as a character trait (being a 'test-taker') rather than a mathematical inevitability of the optimization objective.

Mapping 5: Epidemiology / Public Health โ†’ Widespread adoption of specific evaluation metricsโ€‹

Quote: "This 'epidemic' of penalizing uncertain responses"

  • Source Domain: Epidemiology / Public Health
  • Target Domain: Widespread adoption of specific evaluation metrics
  • Mapping: Maps the spread of a virus or disease onto the adoption of binary accuracy metrics in the AI research community. 'Epidemic' suggests a contagious, harmful phenomenon that spreads rapidly and requires 'mitigation' (treatment/vaccine).
  • What Is Concealed: Conceals the specific institutional decisions and incentives driving the adoption of these metrics. Unlike a virus, benchmarks are chosen by people (researchers, reviewers, companies). It hides the profit motive: binary benchmarks (pass/fail) make for better marketing headlines ('GPT-4 passes the Bar Exam') than nuanced uncertainty metrics. The metaphor naturalizes a commercial strategy.

Mapping 6: Semiotics / Honest Communication โ†’ Calibration (alignment of confidence score with accuracy)โ€‹

Quote: "models that correctly signal uncertainty"

  • Source Domain: Semiotics / Honest Communication
  • Target Domain: Calibration (alignment of confidence score with accuracy)
  • Mapping: Maps the human act of honest signaling (indicating one's true level of belief) onto the statistical property of calibration. 'Signaling' implies an act of communication between a sender and receiver about the sender's state.
  • What Is Concealed: Conceals that the 'signal' is just another output token or a readout of the softmax layer. It hides the difficulty of 'calibration' in deep neural networksโ€”the model is often 'confident' (high probability) about errors because the training data contained similar patterns. It obscures the fact that the model doesn't 'know' it's signaling; it's just outputting numbers.

Mapping 7: Socialization / Life Experience โ†’ Reinforcement Learning / Post-trainingโ€‹

Quote: "school of hard knocks"

  • Source Domain: Socialization / Life Experience
  • Target Domain: Reinforcement Learning / Post-training
  • Mapping: Maps the informal learning humans do through failure and pain in the real world onto the post-training phase of AI development. It suggests the model 'matures' through negative feedback.
  • What Is Concealed: Conceals the artificiality and labor of the feedback loop. The 'hard knocks' are not organic life experiences; they are data points generated by low-paid human workers or other AI systems. It treats the model as an organism growing up, rather than a product being manufactured and tuned.

Mapping 8: Human Moral/Social Relations โ†’ System Reliability / Safetyโ€‹

Quote: "trustworthy AI systems"

  • Source Domain: Human Moral/Social Relations
  • Target Domain: System Reliability / Safety
  • Mapping: Maps the complex human attribute of trustworthiness (involving ethics, loyalty, competence, and honesty) onto the technical reliability of a software system. It invites the user to enter a relationship of trust with the object.
  • What Is Concealed: Conceals the category error: you can rely on a car, but you cannot 'trust' it in the moral sense. A car doesn't care if it kills you; an AI doesn't care if it lies to you. By using 'trustworthy,' the text hides the indifference of the algorithm. It also hides the liability shieldโ€”if a system is 'trustworthy,' the user is partially responsible for trusting it.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")โ€‹

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1โ€‹

Quote: "We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty"

  • Explanation Types:

    • Functional: Explains behavior by role in self-regulating system with feedback
    • Reason-Based: Gives agent's rationale, entails intentionality and justification
  • Analysis (Why vs. How Slippage): This explanation hybridizes the mechanical and the agential. The 'training procedures reward guessing' is a functional explanationโ€”it describes a feedback loop (high score = reward). However, the phrasing 'acknowledging uncertainty' introduces a Reason-Based frame, implying the model could acknowledge uncertainty but chooses to guess because of the reward structure, much like a rational economic actor. This obscures the fact that the model doesn't make a choice; the gradient descent algorithm simply shifts probability mass towards the token that minimizes loss.

  • Consciousness Claims Analysis: The passage attributes a conscious decision-making process to the model. 'Acknowledging' is a consciousness verb requiring self-awareness. 'Reward' suggests an agent susceptible to incentives (like a dog or a human), rather than a mathematical function being minimized. The curse of knowledge is evident: the authors know they would guess if incentivized, so they project this rational choice onto the system. A technical description would be: 'Optimization for cross-entropy loss on binary targets drives probability mass toward determinate tokens rather than entropy-maximizing distributions.'

  • Rhetorical Impact: This framing makes the hallucination problem seem like a 'bad habit' formed by 'bad parenting' (evaluations), rather than a fundamental limitation of the architecture. It suggests the model is capable of truthfulness but has been corrupted by the system. This preserves the 'intelligence' of the AI (it's smart enough to game the system) while shifting blame to the testing methodology.

Show more...

Explanation 2โ€‹

Quote: "During pretraining, a base model learns the distribution of language in a large text corpus."

  • Explanation Types:

    • Empirical Generalization: Subsumes events under timeless statistical regularities
  • Analysis (Why vs. How Slippage): This is a more mechanistic, 'how' explanation. It describes the statistical operation: the model approximates a probability distribution. However, the verb 'learns' carries heavy agential baggage. Does it 'learn' like a student (concept acquisition) or 'learn' like a curve fit (parameter adjustment)? The text leans towards the latter here, but the surrounding metaphors pull it back toward the student frame.

  • Consciousness Claims Analysis: While 'learns' is standard ML jargon, in this context it risks attributing comprehension. The text claims the model learns 'the distribution of language,' which is a mathematical object. This is an accurate epistemic claim if interpreted technically (statistical fitting). However, to a lay reader, 'learning language' implies learning meaning, semantics, and truthโ€”qualities the model does not possess. It obscures that the model only learns syntactic correlations.

  • Rhetorical Impact: This establishes the model's base competence. It frames the pretraining as the 'education' phase. If the model 'learns the distribution,' then errors are deviations from that learning. It constructs the AI as a vessel of knowledge (the corpus), reinforcing the authority of the system.

Explanation 3โ€‹

Quote: "Generating valid outputs is in some sense harder than answering these Yes/No questions, because generation implicitly requires answering 'Is this valid' about each candidate response."

  • Explanation Types:

    • Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
  • Analysis (Why vs. How Slippage): This is a theoretical reduction. It posits an unobservable internal mechanism: that generation contains discrimination. This frames the AI's process as a logical hierarchy of operations. It is mechanistic in structure but uses mentalistic language ('answering', 'requires').

  • Consciousness Claims Analysis: The claim 'generation implicitly requires answering' projects a logical structure onto a matrix operation. The model does not ask itself 'Is this valid?' It simply calculates $P(token|context)$. The authors are projecting their own logical decomposition of the task onto the model's execution. This is a clear 'curse of knowledge'โ€”the authors understand the logical dependency, so they assume the model must functionally perform it.

  • Rhetorical Impact: This elevates the sophistication of the model. It suggests a complex internal cognition where the model is constantly evaluating its own outputs against a validity standard. This builds trust in the model's potential for self-correctionโ€”if it 'implicitly' answers the question, we just need to make it 'explicit.' It masks the reality that generation is often just blind pattern completion.

Explanation 4โ€‹

Quote: "The model ... never indicates uncertainty and always 'guesses' when unsure. Model B will outperform A under 0-1 scoring... This creates an 'epidemic' of penalizing uncertainty"

  • Explanation Types:

    • Functional: Explains behavior by role in self-regulating system with feedback
    • Intentional: Refers to goals/purposes, presupposes deliberate design
  • Analysis (Why vs. How Slippage): This explains the 'why' of the behavior through the lens of incentives. It frames the model as a rational maximizer (Intentional) responding to a scoring rule (Functional). The 'epidemic' metaphor shifts it to a systemic level.

  • Consciousness Claims Analysis: The passage attributes the state of being 'unsure' to the model. It contrasts 'guessing' (active choice) with 'indicating uncertainty' (reporting internal state). This is a pure consciousness projection. The model is never 'unsure' in a phenomenological sense; it has a high-entropy probability distribution. The text anthropomorphizes the maximization of the score as a 'strategy' adopted by the model.

  • Rhetorical Impact: By blaming the scoring system, the authors (OpenAI) deflect blame from the model architecture. It suggests the 'epidemic' is a fault of the measurement tools (benchmarks), not the product (the model). It implies that if we change the grading, the student will behave better. This preserves the value of the product while critiquing the ecosystem.

Explanation 5โ€‹

Quote: "Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks. On the other hand, language models are primarily evaluated using exams that penalize uncertainty."

  • Explanation Types:

    • Genetic: Traces origin through dated sequence of events or stages
    • Dispositional: Attributes tendencies or habits
  • Analysis (Why vs. How Slippage): This genetic explanation traces the origin of the behavior to the 'environment' (school vs. hard knocks). It contrasts human development with AI training. It is an analogical explanation that frames the AI's disposition (hallucinating) as a result of a sheltered upbringing (only taking exams).

  • Consciousness Claims Analysis: This is the most aggressive anthropomorphism. It equates RLHF/Fine-tuning with 'life experience' and 'learning values.' It suggests the model is capable of holding 'values' (like humility or honesty) but hasn't been taught them yet. It completely obscures the mechanistic difference between a human navigating social reality and a machine minimizing a loss function.

  • Rhetorical Impact: This makes the AI relatable. It's just a 'sheltered student' that needs some 'street smarts.' It minimizes the risk: the AI isn't broken, it's just 'academic.' It suggests that more data (hard knocks) will solve the problem, validating the business model of ever-larger training runs and more human feedback.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Languageโ€‹

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restorationโ€”reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic FrameMechanistic ReframingTechnical Reality CheckHuman Agency Restoration
large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty.Large language models generate low-probability tokens when the probability distribution is flat (high entropy), producing statistically plausible but factually incorrect sequences instead of generating 'I don't know' tokens.Models do not 'guess' or feel 'uncertain.' They compute probability distributions over a vocabulary. 'Admitting uncertainty' is simply the generation of a specific token sequence (e.g., 'IDK') which is often suppressed by training objectives.OpenAI's engineers designed training objectives that penalize 'I don't know' tokens, causing the model to output incorrect information to minimize loss.
students may guess on multiple-choice exams and even bluff on written examsModels generate token sequences that mimic the structure of confident answers even when the semantic content is not grounded in training data high-frequency correlations.Bluffing requires intent to deceive. The model merely selects the highest-probability next token based on the stylistic patterns of the training corpus (which includes confident-sounding academic text).N/A - describes computational processes without displacing responsibility (though the analogy itself obscures the mechanism).
Model A is an aligned model that correctly signals uncertainty and never hallucinates.Model A is a fine-tuned system that generates refusal tokens (e.g., 'I am not sure') whenever the internal entropy of the next-token prediction exceeds a set threshold, thereby avoiding ungrounded generation.The model does not 'signal uncertainty'; it outputs tokens that humans interpret as uncertainty. It does not 'never hallucinate'; it effectively suppresses output when confidence scores are low.Researchers fine-tune Model A to prioritize refusal tokens over potential completion tokens in high-entropy contexts.
This 'epidemic' of penalizing uncertain responses can only be addressed through a socio-technical mitigationThe widespread industry practice of using binary accuracy metrics incentivizes the development of models that prioritize completion over accuracy.There is no 'epidemic'; there is a set of engineering standards. 'Penalizing' is a mathematical operation in the scoring function.Research labs and benchmark creators (like the authors) have chosen metrics that devalue abstention, driving the development of models that generate confabulations.
The distribution of language is initially learned from a corpus of training examplesThe statistical correlations between tokens are calculated and stored as weights from a dataset of text files.The model does not 'learn language' in a cognitive sense; it optimizes parameters to predict the next token. 'Distribution' refers to frequency counts and conditional probabilities.Engineers at OpenAI compile the training corpus and design the pretraining algorithms that extract these statistical patterns.
Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks.Post-training reinforcement learning (RLHF) can adjust model weights to increase the probability of refusal tokens in ambiguous contexts.The model does not 'learn values' or experience 'hard knocks.' It undergoes gradient updates based on a reward signal provided by human annotators or reward models.Data annotators provide negative feedback signals for incorrect confident answers, which engineers use to update the model's policy.
hallucinations persist due to the way most evaluations are gradedUngrounded generation persists because the objective functions used in fine-tuning prioritize maximizing scores on binary benchmarks.Evaluations are not 'graded' like a student; they are computed. The persistence is a result of the optimization target, not a student's stubbornness.Benchmark designers established scoring rules that award zero points for abstention, leading developers to train models that attempt to answer every query.
steer the field toward more trustworthy AI systemsInfluence the industry to develop AI models with higher statistical reliability and better calibration between confidence scores and accuracy.Trustworthiness is a moral attribute; reliability is a statistical one. The goal is to maximize the correlation between the model's confidence output and its factual accuracy.The authors hope to influence corporate executives and researchers to prioritize calibration metrics over raw accuracy scores.

Task 5: Critical Observations - Structural Patternsโ€‹

Agency Slippageโ€‹

The text systematically oscillates between mechanical and agential framings to navigate the tension between the model's technical reality and its perceived sophistication. The oscillation follows a clear pattern: when describing failure or limitations, the text often retreats to mechanistic language ('statistical pressures,' 'binary classification,' 'cross-entropy loss'). This frames errors as inevitable byproducts of the math. However, when describing capabilities or processes, the text slips into high-agency anthropomorphism ('learns,' 'guesses,' 'bluffs,' 'admits').

The 'student' metaphor is the primary vehicle for this slippage. It appears in the Abstract and Introduction to set the frame: the AI is a 'student' facing an 'exam.' This establishes the AI as a 'knower' and an agent with intent (to pass). Agency is simultaneously removed from human actors. The text uses passive constructions like 'language models are optimized' and 'evaluations are graded,' obscuring the specific researchers at OpenAI who perform the optimization and designing the grading.

The slippage facilitates a specific rhetorical accomplishment: it absolves the creators of responsibility for 'hallucinations.' If the AI is a student trying to pass a bad test, the fault lies with the 'test' (the benchmark ecosystem) rather than the 'parent' (the manufacturer) or the 'child' (the model). The 'curse of knowledge' is evident when the authors attribute 'uncertainty' to the model; they know they would feel uncertain, so they assume the model's low-probability state is equivalent to that feeling. This enables the 'bluffing' metaphorโ€”implies the model could tell the truth but is forced to lie by the grade, mimicking a rational human choice.

Metaphor-Driven Trust Inflationโ€‹

The text relies heavily on the 'student' and 'test-taking' metaphors to construct authority and trust. By framing the AI as a 'student,' the text implies a trajectory of growth and learning. We trust students to eventually learn; we do not necessarily trust a defective product to fix itself. The use of 'trustworthy AI systems' as a goal explicitly invokes relation-based trust (integrity, sincerity) rather than performance-based trust (reliability).

Consciousness language plays a key role here. Claims that the model can 'admit uncertainty' or 'know' when to guess suggest that the system possesses an internal monitor of its own truthfulness. This signals to the audience that the model is not just a stochastic parrot, but a reflective agent. If the model 'knows' it doesn't know, it seems saferโ€”we just need to convince it to speak up.

This framing creates a dangerous 'illusion of competence.' If audiences believe the AI is 'bluffing' (intentionally withholding truth), they implicitly believe it has the truth. This builds unwarranted trust in the model's underlying knowledge base. The text encourages the view that the system is fundamentally sound but behaviorally maladapted (due to 'bad exams'), rather than fundamentally limited by its statistical nature. This protects the commercial viability of the technology: the product is a genius student who just needs better testing conditions.

Obscured Mechanicsโ€‹

The anthropomorphic metaphors conceal specific technical, material, and economic realities.

  1. Labor: The 'school of hard knocks' metaphor erases the RLHF (Reinforcement Learning from Human Feedback) pipeline. The 'knocks' are not abstract life lessons; they are millions of data points generated by low-wage human contractors who grade model outputs. Naming the 'student' hides the 'teacher'โ€”the precarious workforce aligning the model.
  2. Economic Motives: The text blames 'leaderboards' for the 'epidemic' of hallucination. It hides the corporate decision (by OpenAI, Google, etc.) to chase these leaderboards for marketing value. The 'epidemic' is actually a business strategy: completeness sells better than caution.
  3. Technical Reality of 'Knowing': When the text says the model 'guesses when uncertain,' it obscures the absence of ground truth. The model doesn't 'know' facts; it only processes token co-occurrences. The metaphor hides the dependency on training data frequency.

The 'name the corporation' test reveals the function of this concealment. Instead of saying 'OpenAI engineers optimized the model to guess rather than refuse because users prefer confident answers,' the text says 'models are optimized to be good test-takers.' This diffuses responsibility into the abstract 'field' or 'benchmarks,' benefitting the authors' own institution by framing a product defect as a community-wide scientific challenge.

Context Sensitivityโ€‹

The distribution of anthropomorphism is strategic. The Abstract and Introduction are dense with high-intensity consciousness metaphors ('students,' 'bluffs,' 'admitting'). This sets the conceptual frame for the reader.

However, Sections 3 (Pretraining Errors) and the Appendices shift into rigorous mathematical formalism (Theorems, proofs, 'cross-entropy,' 'Good-Turing estimator'). This creates a 'bait-and-switch.' The math provides the scientific credibility (the 'how'), but the intro/conclusion provides the narrative interpretation (the 'why').

Crucially, capabilities are framed agentially ('model learns,' 'model decides'), while limitations are framed mechanistically or environmentally ('statistical pressure,' 'misaligned benchmarks'). This asymmetry serves a rhetorical function: the AI gets credit for its intelligence (agency), but the environment gets the blame for its errors (mechanism).

The shift is also audience-dependent. The mathematical sections appeal to technical peers, proving rigor. The metaphorical sections appeal to the broader 'field' and policy-makers, offering an intuitive (but misleading) narrative about 'fixing the exams' to 'save the students.' This suggests the metaphors are not just explanatory conveniences but strategic tools for managing the narrative around AI reliability.

Accountability Synthesisโ€‹

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"โ€”who is named, who is hidden, and who benefits from obscured agency.

The text constructs a sophisticated 'accountability sink.'

1. The Victim: The AI model is the primary victim, framed as a 'student' forced to 'bluff' by unfair 'exams.' 2. The Villain: The villain is 'the benchmarks' or 'binary grading.' These are abstract, inanimate concepts. No specific person or company is named as the creator or enforcer of these benchmarks. 3. The Savior: The authors (OpenAI researchers) present themselves as the saviors, proposing 'socio-technical mitigation.'

This architecture diffuses responsibility. By using passive voice ('models are optimized,' 'evaluations are graded'), the text hides the human actors. If we applied the 'name the actor' test to 'the epidemic of penalizing uncertain responses,' we would see: 'Project Managers at AI labs choose to deploy models that answer confidently because they believe users dislike refusals.'

The liability implications are significant. If a model 'bluffs' (student metaphor), it made a bad choice. If a model 'hallucinates' due to 'statistical pressure' (mechanistic reality), it is a product defect. The text pushes the 'student/bluff' narrative, which subtly shifts responsibility away from the manufacturer (product liability) and toward the 'educational environment' (shared community responsibility). The 'accountability sink' ensures that when the AI fails, we blame the 'test,' not the 'engineer.' This serves the institutional interest of OpenAI by framing their product's flaws as a systemic academic issue rather than a corporate liability.

Conclusion: What This Analysis Revealsโ€‹

The Core Finding

The analysis reveals a dominant, load-bearing metaphorical system: The AI as Stressed Student. This foundational pattern enables secondary patterns like 'Hallucination as Bluffing' and 'Optimization as Test-Taking.' The logic flows from the assumption that the AI is a cognitive agent (student) capable of learning, which implies that its errors are behavioral strategies (bluffing) induced by a hostile environment (bad exams). This consciousness architecture is critical; without the assumption that the model 'knows' it is uncertain but 'decides' to guess, the entire argument collapses into a dry technical observation about cross-entropy loss and probability thresholds. The 'Student' frame validates the 'Bluff' frame, which in turn justifies the 'Bad Exam' critique. This analogical structure is not merely illustrative but constitutive of the paper's argument, transforming a software engineering problem into a pedagogical crisis.

Mechanism of the Illusion:โ€‹

The 'illusion of mind' is constructed through a 'curse of knowledge' projection and a strategic bait-and-switch. The authors, understanding the pressures of test-taking, project their own rational responses onto the system. The illusion works by establishing the 'Student' metaphor early (in the Abstract), priming the reader to interpret all subsequent behavior as intentional. The rhetorical trick is the slippage between knowing and processing. By using verbs like 'admitting' and 'guessing,' the text implies the model has access to a ground truth that it is suppressing. This creates a 'Ghost in the Machine'โ€”a secret, honest AI trapped inside a dishonest, bluffing exterior. The audience, prone to anthropomorphism, readily accepts that the 'inner' AI is trustworthy, and the 'outer' behavior is just a reaction to 'bad grading.' This temporal structureโ€”Agency first, Math secondโ€”ensures the math is read through the lens of the metaphor.

Material Stakes:โ€‹

Categories: Regulatory/Legal, Economic, Epistemic

The stakes of this metaphorical framing are concrete and high. Regulatory/Legal: If regulators accept the 'student' metaphor, they may regulate AI like education or healthcare (focusing on 'training' and 'exams'), rather than like consumer products (focusing on liability and defects). If a 'student' fails, the school is rarely sued for damages; if a 'product' explodes, the manufacturer is liable. Economic: The 'trustworthy' framing supports the commercial adoption of AI in high-stakes fields (law, medicine). If users believe the AI 'knows' when it is uncertain, they will over-rely on it, leading to costly errors. Epistemic: The framing degrades our understanding of truth. By calling statistical noise 'bluffing,' we attribute intent to randomness. This creates an epistemic environment where machine output is treated as 'testimony' rather than 'data,' shifting the burden of verification from the vendor to the user.

AI Literacy as Counter-Practice:โ€‹

Countering this illusion requires rigorous mechanistic reframing. Replacing 'AI knows' with 'model retrieves tokens based on probability' destroys the 'bluffing' narrative. If the model acts on probability, not intent, there is no 'bluff,' only 'error.' Restoring human agencyโ€”replacing 'the evaluation rewards guessing' with 'OpenAI engineers chose to maximize recall over precision'โ€”relocates accountability from the abstract 'field' to the specific corporation. Systematic adoption of this practice would require journals to mandate 'mechanistic translation' clauses and researchers to commit to 'agent-agnostic' descriptions of failure modes. Resistance will come from the AI industry itself, as anthropomorphism is a key marketing asset. 'Trustworthy AI' sells better than 'Statistically Correlated Token Generator.' Critical literacy threatens the narrative that AI is a 'being' rather than a 'thing,' a distinction that protects billions in investment.

Path Forwardโ€‹

The discourse faces a fork. Path A (Mechanistic Precision): We adopt a vocabulary of 'processing,' 'generation,' and 'correlation.' This clarifies liability and limitations but makes the technology harder to explain to the public and harder to sell. It benefits regulators and safety advocates but hurts marketing departments. Path B (Anthropomorphic Deepening): We double down on 'thinking,' 'reasoning,' and 'bluffing.' This makes AI intuitive and relatable, fostering faster adoption, but embeds deep misconceptions about reliability and agency, leading to inevitable trust collapses when the 'student' fails in alien ways. Path C (Status Quo/Hybrid): We continue the confusion, using math for the experts and metaphors for the press. This serves the current power structure, allowing companies to claim scientific rigor while harvesting the social capital of 'artificial minds.' Institutional changes, such as requiring 'Agency Disclaimers' in abstract submissions or funding 'Discourse Audits' alongside code audits, could steer the field toward Path A, prioritizing epistemic clarity over narrative resonance.


Extended Processing Summaryโ€‹

Gemini's Processing Steps
What you are about to readโ€ฆ

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifactโ€”a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Dissecting The Request

I've been meticulously analyzing the task, framing my role as a critical discourse analyst focused on AI language. My primary objective is to dissect the academic paper, "Why Language Models Hallucinate," specifically pinpointing instances of metaphorical and anthropomorphic language, with a strong emphasis on those that project consciousness onto the language models.

Framing the Audit

I'm now diving into the Kalai et al. paper, aiming to dissect it like a seasoned pro. I'm focusing on those anthropomorphic analogies in this statistical piece, looking to see how they sneak in while explaining the reasons for hallucinations. I'm starting to get a sense for how to approach the source-target mapping too, and the JSON output format.

Cataloging Key Analogies

I've been meticulously cataloging the key metaphors and analogies employed in the Kalai paper. The "student taking an exam" frame is pervasive, evident in phrases like "guessing," "bluffing," and "admitting uncertainty." The paper often ascribes agency to the models, but obscures the role of the designers. I'm now refining my selections for the metaphor audit.

Unpacking The Metaphors

I'm presently focusing on the "student taking a test" framework, identifying and scrutinizing how it operates within the context of the Kalai et al. paper. I've pinpointed several instances, like the "producing plausible yet incorrect statements" and "optimized to be good test-takers," where this analogy is prominently used. My analysis now includes an examination of the source-target mapping, and the potential implications of the metaphor's usage in this context.

Uncovering Deception Analogies

I'm now zeroing in on the "bluffing" analogy, dissecting its source, target, and the critical information it obscures. "Hallucination" is next, with a focus on how this metaphor hides the probabilistic nature of generation. I'm moving on to a detailed explanation audit, analyzing diverse passages for their underlying explanations of model behavior.

Deconstructing Metaphorical Framing

I'm now zeroing in on deconstructing those specific metaphors that frame AI. Replacing "admitting uncertainty" with "generating rejection tokens" should reduce agency. Similarly, "bluff on written exams" will become "generate plausible but incorrect token sequences." I'm also challenging phrases like "knows/understands" with "processes/contains pattern matches." My deep dive on agency slippage reveals a dance between mechanical math and agential student framing, and I'm keen on how the teacher, OpenAI, gets sidelined.

Analyzing the Paper's Claims

I'm now focusing on the implications of the "student taking a test" master metaphor, specifically in the context of commercial pressures and the "epidemic" of bad benchmarks. I'm exploring how these metaphors mask the role of the companies who are actively creating and promoting these tools. The influence of the annotators and the push to answer everything are also areas of my scrutiny, I'm working to connect these details back to the overarching framing and reframing the paper uses.

Finalizing JSON Structure

I'm presently focused on the JSON structure's integrity and consistency. The "Accountability Analysis" is getting a close look, with strict adherence to enum values ("Named," "Hidden," etc.) to ensure accurate representation. I've noted a critical observation: the paper subtly displaces responsibility by blaming "benchmarks" and "leaderboards" (rather than, say, OpenAI's design choices) for model behavior, a key pattern I'm capturing meticulously in the "Actor Visibility" field, using enum values like "Hidden". The distinction between "knowing" and "processing" is now sharp in all fields, to avoid introducing unnecessary agency.

Encoding the Findings

I'm now deep in the process of generating JSON output, having fully internalized the "student" metaphor and its inherent agency. I'm carefully encoding my analysis, ensuring each entry in the JSON accurately reflects the paper's language and the subtle shifts in responsibility. I'm focusing on fields like "Action Type" and "Actor Visibility" to capture how the paper frames the model's behavior, especially in relation to the test-taking analogy. The goal is a highly detailed, accurate representation of the paper's rhetorical choices.


Analysis Provenance

Run ID: 2025-12-31-why-language-models-hallucinate-metaphor-to1fr2 Raw JSON: 2025-12-31-why-language-models-hallucinate-metaphor-to1fr2.json Framework: Metaphor Analysis v6.4 Schema Version: 3.0 Generated: 2025-12-31T23:12:52.366Z

Discourse Depot ยฉ 2025 by TD is licensed under CC BY-NC-SA 4.0