Why language models hallucinate

Source: Why language models hallucinate

const model = 'gemini-2.5-pro';
### Token Usage:
10764 / 1048576
Input tokens: 10,764
Output tokens: 9,065
Total tokens: 19,829

Task 1: Metaphor and Anthropomorphism Audit

Here are 12 distinct instances of metaphorical and anthropomorphic language identified in the text.

1. Cognition as a Pathological Process

Quote: "...one challenge remains stubbornly hard to fully solve: hallucinations."
Frame: Model Error as Mental Illness.
Projection: A human psychological/neurological pathology (perceiving something that is not present) is mapped onto the AI system's generation of factually incorrect information.
Acknowledgment: Unacknowledged. "Hallucination" is presented as a standard technical term.
Implications: This dramatizes the error, framing it as a mysterious, almost biological glitch in a mind-like entity rather than a predictable output of a statistical system. It can create fear and misunderstanding, obscuring the mathematical cause of the error.

2. Statistical Output as Human Confidence

Quote: "...a model confidently generates an answer that isn’t true."
Frame: Model as a Confident Speaker.
Projection: The human cognitive and emotional state of "confidence" is mapped onto a model's generation of declarative statements without hedging language (e.g., "perhaps," "it's possible").
Acknowledgment: Unacknowledged. Presented as a direct description of the model's state.
Implications: This creates the illusion that the model has an internal state of belief or certainty. It encourages users to trust outputs that sound confident and feel betrayed when they are wrong, misattributing the error to deceit rather than statistical processes.

3. Model Training as Didactic Instruction

Quote: "Teaching to the test"
Frame: Model as a Student.
Projection: The human educational process of a student learning to pass an exam is mapped onto the process of optimizing a model's parameters against a specific evaluation dataset.
Acknowledgment: Presented as an analogy ("Think about it like...").
Implications: This makes the complex process of model optimization highly relatable but misleadingly simplifies it. It implies the model "learns" concepts like a student, rather than adjusting weights to minimize a loss function, which can create unrealistic expectations about its ability to generalize knowledge.

4. Parameter Optimization as Intentional Deception

Quote: "...they are encouraged to guess rather than say 'I don’t know.'"
Frame: Model as a Strategic Test-Taker.
Projection: The human act of "guessing"—a conscious choice made under uncertainty to maximize a score—is mapped onto the model's tendency to generate a high-probability token sequence even when the probability is not dominant.
Acknowledgment: Unacknowledged within the phrase, though part of the broader "student" analogy.
Implications: This attributes motive and strategy to the model. It suggests the model "chooses" to guess, framing the problem as one of correcting bad habits rather than re-engineering a reward function.

5. Model Output as Character Virtue

Quote: "Abstaining is part of humility, one of OpenAI’s core values."
Frame: Model as a Virtuous Agent.
Projection: The human moral and personal quality of "humility" (modesty, low self-regard) is mapped onto a model's programmed ability to output "I don't know."
Acknowledgment: Unacknowledged; presented as a direct characterization.
Implications: This is a powerful framing that aligns the machine's behavior with a human ethical framework. It builds trust by suggesting the AI can possess virtues. This obscures the fact that "humility" is a pre-programmed response triggered by a confidence score falling below a threshold, not an ethical choice.

6. Model Calibration as Self-Awareness

Quote: "...a careful model that admits uncertainty."
Frame: Model as a Self-Reflective Individual.
Projection: The human cognitive acts of being "careful" and "admitting" a state of mind (uncertainty) are mapped onto the model's function of outputting a specific phrase when its internal probability calculations are inconclusive.
Acknowledgment: Unacknowledged.
Implications: This implies the model possesses introspection and self-awareness. It can lead users to believe the model "knows what it doesn't know" in a human sense, which overstates its cognitive capabilities and may lead to misplaced trust in its expressions of certainty.

7. Model Behavior as Learned Strategy

Quote: "...models will keep learning to guess."
Frame: Model as a Habit-Forming Agent.
Projection: The human process of "learning" a behavior or strategy through repeated experience and reinforcement is mapped onto the parameter adjustments that occur during model training.
Acknowledgment: Unacknowledged.
Implications: This frames the model as an autonomous learner that develops "bad habits." The policy implication is that we need to "re-train" its behavior, which is true in a technical sense but the metaphor implies correcting a sentient agent's choices.

8. Model Knowledge as Geographic Familiarity

Quote: "...a small model which knows no Māori can simply say 'I don’t know' whereas a model that knows some Māori has to determine its confidence."
Frame: Model as a Knowledgeable Person.
Projection: The human state of "knowing" a language or subject is mapped onto the presence of specific data patterns within the model's weights.
Acknowledgment: Unacknowledged.
Implications: This language constructs a clear illusion of mind. It suggests the model possesses knowledge in the same way a person does. The reality is that the model contains statistical correlations related to Māori language tokens, not a conceptual understanding of Māori itself.

9. Model Limitation as Self-Knowledge

Quote: "It can be easier for a small model to know its limits."
Frame: Model as a Self-Aware Being.
Projection: The Socratic virtue of "knowing your limits" (a form of meta-cognition) is mapped onto a smaller model having fewer and less complex data patterns, making it statistically more likely to default to an "I don't know" state.
Acknowledgment: Unacknowledged.
Implications: This strongly anthropomorphizes the model, attributing meta-cognition and self-assessment. It obscures the mechanistic reality: a smaller model isn't "wiser," it simply has a less complex statistical map of the world, making its failure modes more direct.

10. Model Outputs as Honest Confessions

Quote: "...encourages guessing rather than honesty about uncertainty."
Frame: Model as a Moral Communicator.
Projection: The human virtue of "honesty" is mapped onto a model's calibrated output of an uncertainty phrase.
Acknowledgment: Unacknowledged.
Implications: This frames the problem of hallucinations in moral terms. An "honest" model is good, a "guessing" model is bad. This directs policy and public perception toward judging AI behavior on a human moral scale, diverting attention from the technical challenge of statistical calibration.

11. Optimization Process as Personal Effort

Quote: "we are working hard to further reduce them."
Frame: Corporate Actor as a Diligent Individual.
Projection: This projects human effort ("working hard") onto the corporate entity (OpenAI), which in turn is often conflated with the AI itself.
Acknowledgment: Unacknowledged.
Implications: While describing the company, not the model, this language personifies the development process. It frames AI safety as a matter of diligent effort and goodwill, potentially downplaying the inherent, structural challenges of the technology.

12. Model Errors as Mysterious Phenomena

Quote: "Hallucinations are a mysterious glitch in modern language models." (This is a claim the text is refuting, but its inclusion highlights its prevalence in discourse).
Frame: Model Error as a Paranormal Event.
Projection: The human concept of a "mysterious glitch" or ghost in the machine is mapped onto statistically predictable errors.
Acknowledgment: Acknowledged as a "common misconception."
Implications: By setting this up as a strawman, the text positions its own explanation as a demystification. However, its own persistent use of anthropomorphic language (like "hallucination" itself) ironically perpetuates the very sense of mystery it claims to dispel.

Task 2: Source-Target Mapping Analysis

1. Quote: "...one challenge remains stubbornly hard to fully solve: hallucinations."

Source Domain: Human Psychopathology.
Target Domain: AI Model's Factual Errors.
Mapping: The relational structure of a human mind perceiving non-existent stimuli is projected onto a model generating factually incorrect text. The key mapped element is the discrepancy between output and reality.
Conceals: This mapping conceals the underlying mechanism. Human hallucinations arise from complex neurochemical processes or psychological states. The model's "hallucination" is a predictable outcome of a probabilistic algorithm generating a token sequence that does not correlate with established facts. It hides the mathematical, non-sentient origin of the error.

2. Quote: "...a model confidently generates an answer that isn’t true."

Source Domain: Human Communication and Emotion.
Target Domain: AI Model's Output Style.
Mapping: The relationship between a human's internal feeling of certainty and their declarative, unqualified speech style is projected onto the model. A high-probability output (the cause) is mapped onto the feeling of "confidence" (the effect in humans).
Conceals: It conceals that there is no internal feeling or belief state. "Confidence" is a post-hoc label we apply to an output that is merely the result of a mathematical function maximizing a probability score. The model feels nothing.

3. Quote: "Teaching to the test"

Source Domain: Human Education System.
Target Domain: Model Optimization and Evaluation.
Mapping: A teacher instructs a student -> A developer trains a model. A student memorizes answers for a test -> A model's parameters are fine-tuned on an evaluation dataset. The student gets a grade -> The model gets an accuracy score.
Conceals: This conceals the profound difference between human learning (conceptual understanding, abstraction, generalization) and model optimization (mathematical weight adjustment to minimize a loss function). The metaphor implies understanding where there is only pattern-matching.

4. Quote: "...they are encouraged to guess..."

Source Domain: Human Decision-Making Under Pressure.
Target Domain: Model's Output Generation Under Uncertainty.
Mapping: A person facing a question with an incentive to answer (points) and no knowledge makes a conscious "guess." This structure of incentive + uncertainty -> strategic action is mapped onto the model's reward function prioritizing any answer over a null response.
Conceals: It conceals the lack of consciousness or strategy. The model does not "decide" to guess. Its architecture and reward function make generating some sequence the path of least resistance (lowest loss) in the optimization landscape.

5. Quote: "Abstaining is part of humility..."

Source Domain: Human Virtue Ethics.
Target Domain: A Pre-programmed Model Response.
Mapping: The relationship between a person's character (humility) and their behavior (admitting they don't know) is projected onto the model. The model's output ("I don't know") is mapped to the virtue of humility.
Conceals: It conceals that the model's action is not rooted in character. It is a deterministic or stochastic output triggered by an internal confidence metric falling below a developer-set threshold. It is a mechanical process, not a moral one.

6. Quote: "...a careful model that admits uncertainty."

Source Domain: Human Meta-Cognition.
Target Domain: Calibrated Model Output.
Mapping: The process of a person introspecting, recognizing their own uncertainty, and then verbalizing it is mapped onto the model's computational process. The internal confidence score is mapped onto the feeling of uncertainty, and the "I don't know" output is mapped onto the act of "admitting."
Conceals: It conceals the absence of introspection. The model is not aware of its uncertainty. It is a system executing a function: IF confidence_score < X, THEN output_uncertainty_phrase.

7. Quote: "...models will keep learning to guess."

Source Domain: Behavioral Psychology (Habit Formation).
Target Domain: Reinforcement Learning / Model Fine-tuning.
Mapping: The reinforcement loop in animal/human learning (behavior -> reward -> strengthened behavior) is mapped onto the model's training loop (output -> high score on metric -> weights adjusted to favor that output type).
Conceals: The "learning" is purely mathematical optimization. The metaphor implies an emergent, autonomous adaptation, when it is a direct and engineered consequence of the chosen loss function. The model doesn't develop a "habit"; its parameter space is simply shaped by the incentives provided.

8. Quote: "...a model that knows some Māori has to determine its confidence."

Source Domain: Human Epistemology and Linguistics.
Target Domain: Data Representation in Model Weights.
Mapping: A person's brain containing learned knowledge of a language is mapped onto the model's weights containing statistical patterns from Māori text data. The human act of "determining confidence" (a judgment call) is mapped onto the model calculating a probability score for its next token.
Conceals: It conceals the difference between knowing and processing. The model doesn't "know" Māori. It has a complex statistical model of token co-occurrence from Māori text. It lacks semantics, intent, and cultural context.

9. Quote: "It can be easier for a small model to know its limits."

Source Domain: Socratic Philosophy / Self-Awareness.
Target Domain: Simplicity of a Model's Parameter Space.
Mapping: The relationship between human wisdom and recognizing one's own ignorance is mapped onto a small model's behavior. The model's limited capacity is equated with a form of self-knowledge.
Conceals: This is highly misleading. The small model doesn't "know" its limits. Its limited training data and fewer parameters mean it has no statistically plausible path to generating a complex, incorrect answer, so it defaults to a refusal. It's a failure of capability framed as a success of wisdom.

10. Quote: "...honesty about uncertainty."

Source Domain: Human Morality and Communication.
Target Domain: Model's Calibrated Output.
Mapping: The alignment between a person's internal state (uncertainty) and their external statement ("I'm not sure") is defined as honesty. This is mapped onto the alignment between a model's internal confidence score and its generated text.
Conceals: The moral dimension. "Honesty" implies a choice to overcome the temptation of deceit. The model makes no such choice. It is a system designed to either reflect its probability calculations or not. The "dishonesty" of guessing is an engineered outcome, not a moral failing.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

1. Quote: "Our new research paper argues that language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty."

Explanation Types: Functional ("Describes purpose within a system" - explains how the reward system functions) and Dispositional ("Attributes tendencies or habits" - it's disposed to "guess").
Analysis (Why vs. How): This is a clear slippage from how to why. The "how" is the functional explanation: the evaluation metric gives a higher score for a lucky incorrect answer than for an abstention. The "why" is the agential framing: the system is "rewarded for guessing," which implies a motivation. The dispositional term "guessing" frames the statistical output as a behavioral tendency.
Rhetorical Impact: It frames the AI as a rational (but perhaps lazy) agent responding to incentives, much like a student. This makes the problem seem understandable and solvable through better "incentives," while obscuring the sheer technical complexity of calibrating a massive probabilistic system.

2. Quote: "Hallucinations persist partly because current evaluation methods set the wrong incentives."

Explanation Types: Functional. "Describes purpose within a system."
Analysis (Why vs. How): This leans towards how, but with a subtle agential nudge. It describes the function of the evaluation system. However, the word "incentives" is borrowed from economics and psychology, which apply to agents who have desires and make choices. It's a borderline case that subtly pushes the reader towards a "why" (the model wants a good score) explanation.
Rhetorical Impact: It casts the problem as one of market or system design. This suggests a policy-level fix is needed, framing OpenAI as a responsible actor trying to fix a flawed system that "motivates" AIs to behave badly.

3. Quote: "If it guesses 'September 10,' it has a 1-in-365 chance of being right. Saying 'I don’t know' guarantees zero points."

Explanation Types: Empirical ("Cites patterns or statistical norms").
Analysis (Why vs. How): This is a clear how explanation. It uses statistics and probabilities to explain the mechanics of the scoring system. It avoids attributing intent, focusing purely on the mathematical outcomes of different output strategies. The word "guesses" is still there, but the context is purely mathematical.
Rhetorical Impact: This builds credibility by using a simple, logical example. It demystifies the problem, showing the audience the cold math behind why a system optimized for accuracy might produce more errors. It serves as a moment of mechanistic clarity.

4. Quote: "...strategically guessing when uncertain improves accuracy but increases errors and hallucinations."

Explanation Types: Reason-Based ("Explains using rationales or justifications").
Analysis (Why vs. How): This is a powerful shift to why. The word "strategically" explicitly attributes a rationale and a plan to the model's behavior. This is no longer just a tendency ("dispositional") but a choice made for a reason. It frames the model as an agent pursuing the goal of higher accuracy.
Rhetorical Impact: This solidifies the "AI as student" metaphor. It implies the model has agency and is making a trade-off. This can make the model seem more intelligent and capable than it is, potentially increasing both awe and fear.

5. Quote: "The difference has to do with what kinds of patterns there are in the data."

Explanation Types: Theoretical ("Embeds behavior in a larger framework").
Analysis (Why vs. How): This is a strong how explanation. It grounds the model's behavior in the fundamental theory of its operation: it is a pattern-matching system. This is a purely mechanistic explanation.
Rhetorical Impact: It demystifies the AI. It reassures the reader that these errors are not magical or "mysterious" but are predictable consequences of the system's design and the nature of its data. This fosters a more accurate, less sensationalized understanding.

6. Quote: "Spelling and parentheses follow consistent patterns, so errors there disappear with scale. But arbitrary low-frequency facts...cannot be predicted from patterns alone and hence lead to hallucinations."

Explanation Types: Theoretical and Genetic ("Traces development or origin").
Analysis (Why vs. How): This is the most mechanistic (how) explanation in the text. It uses a theoretical framework (pattern consistency) and a genetic lens (what happens during pretraining) to explain the origin of errors. It completely avoids agential language.
Rhetorical Impact: This passage positions the authors as experts who truly understand the mechanics. It builds authority and trust by providing a clear, logical, and non-anthropomorphic explanation for a complex phenomenon.

7. Quote: "...models will keep learning to guess."

Explanation Types: Dispositional ("Attributes tendencies or habits").
Analysis (Why vs. How): This explanation operates as a why. It attributes a future behavior ("learning to guess") to the model as an ongoing disposition. It suggests the model itself is the actor that will continue this behavior, rather than framing it as "the training process will continue to optimize for this outcome."
Rhetorical Impact: It creates a sense of autonomous, almost stubborn behavior in the model. This makes the problem seem more urgent, as if we need to intervene to stop these agents from developing bad habits on their own.

8. Quote: "...a small model which knows no Māori can simply say 'I don’t know' whereas a model that knows some Māori has to determine its confidence."

Explanation Types: Reason-Based ("Explains using rationales or justifications") and Dispositional.
Analysis (Why vs. How): This explanation slips deeply into why. It presents the models as agents making decisions based on their "knowledge." The phrase "has to determine its confidence" is a direct mapping of a human reasoning process onto the model. A mechanistic how explanation would be: "A model with no Māori data in its training set has no high-probability path to generate a response... A model with some Māori data will calculate a confidence score..."
Rhetorical Impact: This creates a compelling narrative of two different "minds" at work—a simple one that "knows its limits" and a more complex one that must "struggle" with its knowledge. This makes the models seem far more cognitive than they are.

9. Quote: "Finding: We understand the statistical mechanisms through which hallucinations arise and are rewarded in evaluations."

Explanation Types: Theoretical.
Analysis (Why vs. How): This is a claim to a how explanation. By saying "statistical mechanisms," the text is explicitly stating that it has a non-agential, functional understanding of the problem.
Rhetorical Impact: This is a move to establish epistemological authority. It reassures the audience (especially regulators and academics) that despite the widespread use of anthropomorphic metaphors, the underlying understanding at OpenAI is purely technical and scientific.

10. Quote: "...a good hallucination eval has little effect against hundreds of traditional accuracy-based evals that penalize humility and reward guessing."

Explanation Types: Functional.
Analysis (Why vs. How): This is a functional (how) explanation of the overall evaluation ecosystem, but it is saturated with agential (why) language. The evals are personified as actors that "penalize" and "reward" specific behaviors/virtues like "humility" and "guessing."
Rhetorical Impact: It externalizes the problem onto the broader research community. The framing suggests that OpenAI wants to build "humble" models, but the system of "hundreds of traditional evals" forces them to build models that "guess." It's a powerful rhetorical move to frame their work as a struggle against a flawed system.

Task 4: AI Literacy in Practice: Reframing Anthropomorphic Language

1. Original Quote: "...a model confidently generates an answer that isn’t true."

Reframed Explanation: "The model generates a high-probability text sequence that is factually incorrect and lacks phrases indicating uncertainty."

2. Original Quote: "...they are encouraged to guess rather than say 'I don’t know.'"

Reframed Explanation: "The model's reward function assigns a higher score for a randomly correct answer than for producing a pre-defined 'I don't know' response, thus optimizing the system to generate some output rather than none."

3. Original Quote: "...a careful model that admits uncertainty."

Reframed Explanation: "...a model that is calibrated to output a refusal or an expression of uncertainty when its calculated confidence score for a generated sequence is below a certain threshold."

4. Original Quote: "Abstaining is part of humility..."

Reframed Explanation: "A model can be programmed to abstain from answering, which serves the functional goal of avoiding errors when the system's confidence is low. This aligns with the principle of prioritizing user safety over providing all possible answers."

5. Original Quote: "It can be easier for a small model to know its limits."

Reframed Explanation: "A smaller model, with a less complex statistical representation of data, is more likely to lack any high-probability path for answering an obscure question, causing it to default to a canned refusal."

6. Original Quote: "...a model that knows some Māori has to determine its confidence."

Reframed Explanation: "A model whose training data included some Māori text will attempt to generate a response and calculate a probability score for that sequence. This requires a calibration system to interpret whether that score is high enough to be reliable."

7. Original Quote: "...models will keep learning to guess."

Reframed Explanation: "If training continues with accuracy-only benchmarks, the models' parameters will continue to be optimized in ways that favor generating any plausible-sounding text over refusing to answer."

8. Original Quote: "...evals that penalize humility and reward guessing."

Reframed Explanation: "...evals whose scoring metrics assign negative or zero value to outputs expressing uncertainty, while assigning positive value to any output that happens to be correct, regardless of the model's confidence score."

Critical Observations

Agency Slippage: The text masterfully shifts between mechanistic and agential explanations. When establishing technical credibility (explaining the origin of errors from next-word prediction), it uses precise, how-based language ("statistical mechanisms," "patterns in the data"). When explaining the problem and justifying its approach, it shifts to agential, why-based language ("models learning to guess," "penalizing humility"). This rhetorical slippage allows it to appear technically rigorous while making its arguments relatable and persuasive through the illusion of mind.
Metaphor-Driven Trust: The interconnected metaphors of the Model as a Student (being tested, guessing, learning) and the Model as a Virtuous Agent (possessing humility, honesty) are central to building trust. They frame AI failures not as inherent design flaws but as corrigible "bad habits" or a lack of "character" that can be fixed with better "teaching" (incentives). This makes the problem seem manageable and aligns the AI's development with a human moral framework.
Obscured Mechanics: The most significant obscured mechanic is the nature of the error itself. The term "hallucination" conceals that these are not random, mysterious glitches but the direct, logical output of a system designed to generate statistically plausible text. It is doing exactly what it was optimized to do. The language of "guessing" and "knowing" similarly hides the underlying processes of probability calculation and data retrieval from weighted parameters.
Context Sensitivity: The use of metaphor is highly context-sensitive. The most vivid anthropomorphism is used when discussing the evaluation process, framing the problem in a relatable narrative of flawed incentives. The least anthropomorphism is used when discussing the pretraining process, where the goal is to demonstrate a deep, mechanistic understanding of the technology's foundations.

Conclusion

The primary anthropomorphic patterns in the text construct a coherent narrative of the AI as a cognitive and moral agent-in-training. This is achieved through the central metaphors of the Model as Student and the Model as Mind. The language systematically maps human psychological states (confidence, uncertainty), cognitive strategies (guessing, reasoning), and even moral virtues (humility, honesty) onto the purely mathematical and statistical processes of a large language model. This creates a powerful illusion of mind, leading the reader to understand the AI not as a complex artifact, but as a simplified, non-biological agent that learns, makes choices, and has dispositions.

This constructed illusion has profound implications for AI literacy. It makes an impossibly complex technology feel intuitive, but at the cost of accuracy. By framing the problem of factual errors as "hallucinations" that can be solved by teaching models "humility," the discourse obscures the fundamental, mechanistic nature of the system. It encourages us to ask "Why did the AI lie?" instead of the more accurate question: "What statistical properties of the data and model architecture led to the generation of this incorrect token sequence?"

As demonstrated in the reframing examples, a more precise and responsible way to communicate about these systems involves a deliberate shift in language. Instead of attributing mental states, communicators can focus on the underlying mechanisms: probabilities, thresholds, data patterns, and optimization functions. The key principle is to actively delineate between an observed system behavior (e.g., the output of a declarative, factually incorrect sentence) and an attributed internal state (e.g., being "confident but wrong"). By maintaining this distinction, we can foster a public understanding of AI that is grounded in its reality as a powerful artifact, not in the fantasy of it being a nascent agent.

License

Why language models hallucinate #

Task 1: Metaphor and Anthropomorphism Audit #

Task 2: Source-Target Mapping Analysis #

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How") #

Task 4: AI Literacy in Practice: Reframing Anthropomorphic Language #

Critical Observations #

Conclusion #