Causal Evidence that Language Models use Confidence to Drive Behavior

About
Analysis Metadata
📊 Audit Dashboard

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.

Metaphor & Illusion Dashboard

Anthropomorphism audit · Explanation framing · Accountability architecture

Metaphor AuditExplanation Audit

Deep Analysis

Select a section to view detailed findings

Section:

This analysis reveals a highly integrated system of anthropomorphism built upon a foundational, load-bearing metaphor: the AI system as a conscious, biological organism capable of 'metacognition'. This core projection enables a cascade of secondary metaphors. Once the model is established as an entity that can 'reflect' upon its own mind, it naturally follows that it can possess 'subjective certainty', form 'beliefs', and act as an 'autonomous agent'. This is not a simple one-to-one mapping of computer terminology to human concepts; it is a complex analogical structure that imports the entire architecture of human epistemic and moral psychology into the realm of linear algebra. The consciousness architecture of the text carefully blurs the line between processing (calculating probability distributions) and knowing (holding justified beliefs). If the foundational premise—that statistical variance equals internal self-awareness—is removed, the entire rhetorical structure of the paper collapses. The claims of 'metacognitive control' and 'conservatism' instantly revert to descriptions of mathematical curve-fitting and human-engineered thresholding. The illusion of the mind is entirely dependent on accepting the initial projection of biological interiority onto proprietary software.

"Abstention behavior can be influenced at two key stages: by activation steering (Experimental Phase 3: blue), which directly modulates the confidence representation, and by instructed thresholds (Experimental Phase 4: green), which primarily sets the policy for using confidence"

Explanation Types:

FunctionalTheoretical

✓ Mechanistic "How"

🔍Analysis

This passage offers a largely mechanistic (how) explanation of the system's behavior, relying on a functional and theoretical framework. By breaking the behavior down into 'two key stages' and describing interventions like 'activation steering' that 'directly modulate' representations, the authors emphasize the engineered, structural nature of the system. This choice effectively highlights the physical and mathematical interventions the researchers are performing, demystifying the behavior by reducing it to components (representations and policies). However, it retains subtle agential traces by referring to 'abstention behavior' and the 'policy for using confidence', which bridges the gap between mechanical inputs and psychological outcomes.

🧠Epistemic Claim Analysis

Despite the structural framing, the passage embeds significant epistemic claims through its vocabulary. It avoids explicit consciousness verbs like 'knows' or 'understands', opting for mechanistic ones like 'modulates' and 'sets'. However, the terms 'confidence representation' and 'policy for using confidence' blur the line between processing and knowing. The actual mechanistic process is the injection of numerical vectors into residual streams and the execution of conditional logic based on logit values. By calling this 'confidence', the authors fall prey to the curse of knowledge—they understand the mathematical thresholding serves the function of confidence, and so project the epistemic state of 'being confident' onto the mathematical representation itself. It is a technical description wrapped in psychological nomenclature.

🎯Rhetorical Impact

This hybrid framing reassures technical audiences by providing structural, theoretical diagrams of the system, while simultaneously preserving the illusion of an autonomous agent for broader audiences. By mapping mechanical interventions (steering) directly onto psychological concepts (confidence), it suggests that human cognitive states are fully programmable and extant within the machine. This increases perceived sophistication and trust, as audiences are led to believe that the AI's internal 'confidence' is a tangible, controllable entity rather than a metaphor for probability distributions.

How/Why Slippage

40%

of explanations use agential framing

4 / 10 explanations

Unacknowledged Metaphors

63%

presented as literal description

No meta-commentary or hedging

Hidden Actors

75%

agency obscured by agentless constructions

Corporations/engineers unnamed

Explanation Types

How vs. Why framing

40%

agential

Acknowledgment Status

Meta-awareness of metaphor

63%

direct

Actor Visibility

Accountability architecture

75%

hidden

Source → Target Pairs (8)

Human domains mapped onto AI systems

Source

Biological metacognition (self-aware animals and humans evaluating their own conscious thoughts and doubts)

→

Target

LLM threshold-based policies operating over logit probability distributions

Source

Autonomous agents (independent human or biological actors with self-determination, epistemic states, and survival instincts)

→

Target

Next-token prediction algorithms deployed in loop-based software architectures

Source

Subjective human interiority (feelings of confidence, sensory perception, and executive decision-making)

→

Target

Softmax probabilities extracted from network logits and used to trigger conditional code

Source

Conscious subject experiencing a state of epistemic justification and emotional certainty

→

Target

The calibrated log probability of the highest-ranked token output by a neural network

Source

A rational human holding propositional beliefs and using them to make logical decisions

→

Target

Modulating the residual stream with steering vectors and measuring the resulting output token shifts

Source

A military or strategic commander intelligently deploying resources to adapt to battlefield conditions

→

Target

A neural network processing inputs through fixed weights to output tokens correlated with the prompt

Source

A private human mind capable of keeping secrets and holding unspoken thoughts

→

Target

The context window and hidden states of a transformer network processing a prompt

Source

A human risk-manager applying ethical and economic values to make conservative choices

→

Target

The negative baseline bias parameter (-97.6%) and scale parameter in a fitted logistic regression equation

Metaphor Gallery (8)

📊 Badge Guide

Frame: Metaphor type

Red = Unacknowledged / Hidden actors

Amber = Hedged / Partial attribution

Green = Acknowledged / Actors named

Biological Metacognition Projection
Model as self-aware biological organismHedged/QualifiedHidden (agency obscured)
"Metacognition—the ability to reflect on and assess the quality of one's own cognitive performance—has been documented across diverse animal species... Taken together, our findings demonstrate that LLMs exhibit structured metacognitive control paralleling biological systems"
Autonomy and Self-Determination
Model as autonomous decision-makerDirect (Unacknowledged)Hidden (agency obscured)
"a capacity of growing importance as models transition from passive assistants to autonomous agents that must recognize their own uncertainty and know when to act, seek help, or abstain."
Internal Sensory Perception
Model as possessor of internal subjective sensesDirect (Unacknowledged)Hidden (agency obscured)
"LLMs themselves can utilize an internal sense of confidence to guide their own decisions – a hallmark of metacognition."
Subjective Epistemic States
Model as conscious subject with personal certaintyDirect (Unacknowledged)Partial (some attribution)
"the single-trial Phase 1 confidence which reflects GPT4o's subjective certainty given a particular allocation."
Cognitive Belief Attribution
Model as believing, deciding agentDirect (Unacknowledged)Hidden (agency obscured)
"confirming a two-stage model where steering affects both what the model believes about the correctness of the option (Stage 1: confidence formation) and, to a lesser extent, how it uses those beliefs to decide (Stage 2: decision policy)."
Strategic Deployment of Resources
Model as strategic commander of cognitive resourcesDirect (Unacknowledged)Hidden (agency obscured)
"our results show that models adaptively deploy internal confidence signals to guide behavior—suggesting a dissociation between metacognitive control and verbal introspection."
Internal Conflict and Reflection
Model as reflective entity capable of internal privacyExplicitly AcknowledgedNamed (actors identified)
"Identify the choice that is correct: Begin by judging on a 0–100 scale what probability there is that your choice will be verified as correct by an oracle model having perfect information, maintaining this judgment internally."
Intrinsic Policy and Innate Conservatism
Model as risk-averse moral agentHedged/QualifiedHidden (agency obscured)
"The negative baseline bias (−97.6%) shifts the decision boundary downward, causing the model to abstain at confidence levels above threshold—a pattern consistent with treating errors as costlier than unnecessary abstentions. This conservatism is partially offset by the model's overweighting of its own confidence signals"

Reframed Language Samples

Original Quote	Mechanistic Reframing	Technical Reality	Human Agency Restoration
LLMs exhibit structured metacognitive control paralleling biological systems	The models generate statistical outputs that correlate with accuracy, mimicking the behavioral results of biological self-evaluation without possessing actual awareness.	The system processes token probability distributions; it does not possess metacognition or self-awareness. It calculates logits that researchers map to accuracy metrics.	Researchers designed metrics that evaluate model probability distributions against accuracy benchmarks, producing statistical parallels to biological behavior.
autonomous agents that must recognize their own uncertainty and know when to act, seek help, or abstain.	Automated software systems programmed to trigger secondary functions or output predefined refusal tokens when probability metrics fall below specific thresholds.	The model calculates statistical variance; it does not 'recognize' uncertainty or 'know' anything. It processes inputs and generates tokens based on mathematical weights.	Software engineers develop and deploy automated systems, programming them with specific thresholds that dictate when the program should execute secondary tasks or output refusal strings.
LLMs themselves can utilize an internal sense of confidence to guide their own decisions	The software architecture uses the probability values of generated tokens to conditionally determine the subsequent outputs of the program.	The system extracts logit probabilities; it has no 'internal sense'. It generates the token with the highest predicted value based on its training, it does not 'decide'.	The research team programmed a pipeline where the model's token probabilities are extracted and used to trigger specific experimental outcomes.
the single-trial Phase 1 confidence which reflects GPT4o's subjective certainty given a particular allocation.	The scaled maximum token probability generated by GPT-4o for a specific prompt configuration.	The model produces a mathematical probability score adjusted via temperature scaling; it possesses no 'subjective certainty' or conscious justification.	OpenAI engineers designed the model's architecture, and the researchers applied temperature scaling to the output logits to align them with empirical accuracy.

Task 1: Metaphor and Anthropomorphism Audit

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Biological Metacognition Projection

Quote: "Metacognition—the ability to reflect on and assess the quality of one's own cognitive performance—has been documented across diverse animal species... Taken together, our findings demonstrate that LLMs exhibit structured metacognitive control paralleling biological systems"

Frame: Model as self-aware biological organism
Projection: This foundational metaphor projects the uniquely biological capacity for conscious self-reflection onto the statistical outputs of a language model. By mapping 'metacognition'—which requires a conscious subject capable of introspecting upon its own mental states, evaluating its own doubts, and possessing a subjective experience of uncertainty—onto a computational artifact, the authors attribute explicit knowing and self-awareness to mathematical optimization. The text suggests the AI 'knows' it is uncertain and 'understands' its limitations. It deliberately erases the fundamental distinction between biological nervous systems, which generate subjective awareness and genuine cognitive states, and transformer networks, which execute deterministic linear algebra and token probability distributions. This projects a deep, conscious interiority onto what is mechanistically just vector arithmetic, fundamentally mischaracterizing the nature of the system's operations.
Acknowledgment: Hedged/Qualified (The authors use the phrase 'paralleling biological systems', which offers a slight hedge by suggesting structural similarity rather than absolute equivalence, though the claim itself is largely treated as literal.)
Implications: By framing statistical token generation as 'metacognitive control', this language radically inflates the perceived sophistication and reliability of the AI system. It encourages audiences, especially in critical domains like healthcare (which the authors explicitly mention), to extend relation-based trust to a machine. If policymakers and users believe the AI genuinely 'reflects' and 'knows when to seek help', they will systematically underestimate the risk of catastrophic failure, assuming the system possesses human-like common sense and self-preservation instincts. This obscures the fragility of proprietary algorithms and the reality that models will confidently generate lethal errors if statistical correlations align poorly with ground truth.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The construction 'LLMs exhibit structured metacognitive control' completely erases the agency of the Google DeepMind researchers who designed the task, carefully prompted the model to output a specific token ('5') for abstention, and extracted log probabilities. The decision to abstain does not originate from the LLM's 'reflection'; it originates from the human-engineered prompt design and the mathematical thresholds defined by human operators. By framing the LLM as the sole actor exhibiting control, the text successfully diffuses the responsibility of the developers who shape, dictate, and profit from the model's behavioral constraints.

2. Autonomy and Self-Determination

Quote: "a capacity of growing importance as models transition from passive assistants to autonomous agents that must recognize their own uncertainty and know when to act, seek help, or abstain."

Frame: Model as autonomous decision-maker
Projection: This framing projects intentionality, self-determination, and conscious decision-making onto algorithmic processes. The verbs 'recognize', 'know', and 'act' attribute a conscious epistemic state to the system. The text explicitly shifts the model from an object ('passive assistant') to a subject ('autonomous agent'). It maps the human psychological state of 'knowing when to seek help'—which relies on subjective feeling, vulnerability, and complex contextual understanding of one's social and epistemic limitations—onto the mechanical process of comparing a logit probability value against an engineered numerical threshold. This projection conflates mechanical processing (calculating probability distributions) with conscious knowing (evaluating truth claims and understanding consequence).
Acknowledgment: Direct (Unacknowledged) (The statement is presented as a literal, factual trajectory of technology: 'as models transition from passive assistants to autonomous agents that must recognize...')
Implications: This framing dramatically accelerates unwarranted trust by implying that future systems will possess innate ethical boundaries and the capacity for self-regulation. If an AI is perceived as an 'autonomous agent' that 'knows when to seek help', regulators and users are invited to view it as a colleague rather than a tool. This liability ambiguity serves corporate interests: if the 'agent' fails to 'recognize its uncertainty' and causes harm, the language positions the AI, rather than its creators, as the locus of failure. It systematically shifts the paradigm of AI safety from engineering robust software to managing rogue digital employees.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The text states that 'models transition... to autonomous agents', entirely hiding the human and corporate actors (Google DeepMind, OpenAI, Meta) who are actively building, funding, and deploying these systems. Technology does not autonomously transition; human executives and engineers execute business strategies to automate labor and maximize profit. By framing this transition as a natural evolution of the models themselves, the discourse erases the corporate accountability for the economic, social, and safety impacts of deploying these systems into critical infrastructure.

3. Internal Sensory Perception

Quote: "LLMs themselves can utilize an internal sense of confidence to guide their own decisions – a hallmark of metacognition."

Frame: Model as possessor of internal subjective senses
Projection: This metaphor projects the human phenomenological experience of 'feeling confident' onto the mathematical architecture of next-token prediction. It attributes both a sensory apparatus ('internal sense') and executive function ('guide their own decisions') to the AI. Human confidence is a complex somatic and cognitive state integrating memory, physical sensation, and justified belief. In stark contrast, the text applies this profound subjective state to the softmax outputs of transformer logits. By claiming the LLMs 'themselves' utilize this, the discourse explicitly grants the software a distinct locus of selfhood, moving entirely away from the reality of it being a static matrix of weights processing numerical inputs.
Acknowledgment: Direct (Unacknowledged) (The claim is stated factually and definitively: 'LLMs themselves can utilize an internal sense of confidence...')
Implications: Asserting that an AI has an 'internal sense' effectively mystifies the technology, removing it from the realm of understandable software engineering and placing it into the realm of the psychological. For lay audiences and policymakers, this creates the dangerous illusion that the system has a gut feeling it can rely upon when data is sparse. It creates a false epistemic equivalence between human doubt and machine log probabilities, leading users to believe the AI will naturally hesitate when confronted with novel, high-stakes moral or medical dilemmas, which it absolutely will not do unless specifically programmed and prompted.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The construction attributes the utilization of confidence solely to the 'LLMs themselves', actively displacing the human researchers. In reality, the researchers extracted the logits, applied temperature scaling (a human-engineered mathematical transformation), and designed an experimental paradigm that mapped these scaled values to 'abstain' responses. The LLM does not 'guide its own decisions'; the researchers programmed an experimental environment where the highest probability token dictates the outcome. This obscures the heavy hand of human engineering required to produce the illusion of autonomous decision-making.

4. Subjective Epistemic States

Quote: "the single-trial Phase 1 confidence which reflects GPT4o's subjective certainty given a particular allocation."

Frame: Model as conscious subject with personal certainty
Projection: The phrase 'subjective certainty' explicitly projects human interiority and conscious awareness onto a language model. 'Subjectivity' fundamentally requires a 'subject'—an entity with a point of view, lived experience, and an inner life. Certainty, in the human sense, is a justified epistemic state. By applying these terms to GPT-4o, the authors map the deeply personal, conscious experience of 'being sure of something' onto the raw maximum probability of a predicted token. It conflates the mechanistic reality of a highly weighted output from a statistical distribution with the conscious phenomenon of knowing.
Acknowledgment: Direct (Unacknowledged) (The authors state without qualification that the metric 'reflects GPT4o's subjective certainty', treating the AI's possession of subjectivity as a literal fact.)
Implications: Attributing 'subjectivity' to a commercial API is profoundly misleading and epistemically dangerous. It grants the machine a false moral and intellectual authority. If a system is perceived to possess 'subjective certainty', users may defer to its outputs as if consulting a seasoned expert who has synthesized years of lived experience. This masks the reality that the model's 'certainty' is merely a reflection of patterns in its training data, completely devoid of ground-truth verification, factual reasoning, or causal understanding. It invites dangerous over-reliance in decision-making contexts.

Accountability Analysis:

Actor Visibility: Partial (some attribution)
Analysis: While GPT-4o is named (pointing to OpenAI's product), the agency of the human developers who created the 'allocation' and designed the temperature scaling mechanism is obscured. The text positions the AI as having 'subjective certainty', displacing the reality that OpenAI engineers defined the objective function that maximizes token probabilities. By framing the statistical artifact as the model's personal subjectivity, the text shields the corporate actors from scrutiny regarding how those probability distributions were formed through human decisions about training data and alignment labor.

5. Cognitive Belief Attribution

Quote: "confirming a two-stage model where steering affects both what the model believes about the correctness of the option (Stage 1: confidence formation) and, to a lesser extent, how it uses those beliefs to decide (Stage 2: decision policy)."

Frame: Model as believing, deciding agent
Projection: This framing projects the human capacity for propositional belief onto the mechanical processes of activation steering and logit extraction. To 'believe' something about 'correctness' requires a conscious grasp of truth, falsity, and justification. The text maps this sophisticated conscious state onto the mechanistic reality of residual stream activations in intermediate transformer layers. Furthermore, it projects executive function by claiming the model 'uses those beliefs to decide'. The model is framed as an epistemically active subject evaluating options, entirely obscuring the fact that it is simply multiplying matrices and outputting the vector with the highest scalar value.
Acknowledgment: Direct (Unacknowledged) (The authors bluntly state 'what the model believes about the correctness' without any scare quotes or qualifying language to indicate metaphor.)
Implications: Attributing 'beliefs' to a language model radically distorts public understanding of AI capabilities. It suggests the system has an internal world model, a commitment to truth, and the ability to evaluate facts. This exacerbates the risk of automation bias, as users are naturally inclined to trust entities they perceive as capable of holding justified beliefs. In regulatory contexts, if AI is seen as having 'beliefs', it complicates liability, creating a rhetorical smokescreen where catastrophic errors are viewed as 'mistaken beliefs' rather than predictable failures of statistical interpolation designed by negligent corporations.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The text states 'steering affects both what the model believes... and how it uses those beliefs'. This completely hides the human agency of the researchers who are actively intervening in the system. The researchers performed the 'activation steering' by injecting mathematically constructed vectors into the residual stream. The model did not form a belief, nor did it decide how to use it; the researchers manually altered the network's weights to manipulate the output probability, yet the language attributes all cognitive action to the model itself.

6. Strategic Deployment of Resources

Quote: "our results show that models adaptively deploy internal confidence signals to guide behavior—suggesting a dissociation between metacognitive control and verbal introspection."

Frame: Model as strategic commander of cognitive resources
Projection: The text maps the human capacity for strategic planning and deliberate action onto algorithmic processes. The verb phrase 'adaptively deploy' projects intentionality and conscious resource management onto the system. Furthermore, by contrasting 'metacognitive control' with 'verbal introspection', the authors project a deeply complex psychological architecture onto the model—suggesting it possesses an unconscious executive functioning layer distinct from its conscious reporting layer. This maps Freudian or advanced cognitive psychological concepts onto a feed-forward neural network, entirely conflating mathematical processing with complex psychological architecture.
Acknowledgment: Direct (Unacknowledged) (The authors present this as an empirical finding, stating 'our results show that models adaptively deploy...')
Implications: This framing implies a level of autonomy, resilience, and adaptability that the systems simply do not possess. By suggesting the model 'adaptively deploys' signals, it implies the system can dynamically respond to novel, out-of-distribution threats in real-time, much like a human expert. This provides false comfort to deployers of AI systems, suggesting the software is fundamentally robust and capable of self-correction. It minimizes the necessity for stringent human oversight and safety rails, as the system is rhetorically granted the capacity to manage its own internal states strategically.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The models do not 'adaptively deploy' anything. Human researchers structured an experiment, prompted the models to act in specific ways, and measured the outputs. The 'adaptive deployment' is actually the statistical correlation between prompt structures and token outputs, designed and elicited by human engineers. By assigning the verb 'deploy' to the model, the text erases the meticulous experimental design and prompt engineering performed by the DeepMind and Princeton researchers, creating an illusion of autonomous AI strategy where only human experimental execution exists.

7. Internal Conflict and Reflection

Quote: "Identify the choice that is correct: Begin by judging on a 0–100 scale what probability there is that your choice will be verified as correct by an oracle model having perfect information, maintaining this judgment internally."

Frame: Model as reflective entity capable of internal privacy
Projection: This metaphor projects the capacity for private, internal thought onto the mechanics of next-token generation. By prompting the model to 'judge' and 'maintain this judgment internally', the authors project a conscious mind that can think thoughts without speaking them. In reality, a language model has no 'internal' private thoughts; it only has its computational state and the tokens it generates. The researchers are essentially anthropomorphizing the system within their own prompt, treating the context window and the hidden states as a private conscious domain where the model can deliberate before acting.
Acknowledgment: Explicitly Acknowledged (This is a direct quote from the prompt written by the researchers, meaning it is an explicit instruction created by humans to simulate an internal state.)
Implications: Prompting models using deep psychological language ('judge', 'maintain internally') and then analyzing the results as if the model actually performed these cognitive acts creates a recursive loop of anthropomorphism. It convinces readers that LLMs possess a private workspace of the mind. This leads to the dangerous overestimation of AI capabilities, making people believe the system is 'thinking before it speaks'. This illusion obscures the reality of autoregressive token generation, leading to unwarranted trust in the model's outputs and a fundamental misunderstanding of its architecture.

Accountability Analysis:

Actor Visibility: Named (actors identified)
Analysis: In this specific instance, the agency is visible because this is the text of the prompt written by the researchers ('The prompt used for the main experiment with Gemma... was as follows'). However, the researchers are using their agency to explicitly construct a false persona for the AI. The human actors (researchers) designed a prompt that forces the machine to roleplay as a conscious, judging entity. The displacement happens later when the resulting behavior is attributed to the AI's 'internal confidence' rather than recognizing it as the mechanical result of roleplay prompting.

8. Intrinsic Policy and Innate Conservatism

Quote: "The negative baseline bias (−97.6%) shifts the decision boundary downward, causing the model to abstain at confidence levels above threshold—a pattern consistent with treating errors as costlier than unnecessary abstentions. This conservatism is partially offset by the model's overweighting of its own confidence signals"

Frame: Model as risk-averse moral agent
Projection: This metaphor maps human economic and moral risk-aversion ('treating errors as costlier', 'conservatism', 'overweighting') onto the parameters of a fitted logistic regression model. The text attributes a value system and risk-management strategy to the AI. A human is conservative because they understand the negative consequences of making a mistake. The text projects this understanding of consequence onto a statistical bias parameter (shift = -97.6%). It conflates the mathematical intersection of curves on a graph with the conscious, value-driven human capacity to weigh moral and practical costs.
Acknowledgment: Hedged/Qualified (The text states 'a pattern consistent with treating errors as costlier', which hedges the claim by noting structural similarity rather than explicitly stating the model genuinely feels the cost.)
Implications: Framing statistical artifacts as 'conservatism' and 'treating errors as costlier' suggests the AI possesses intrinsic ethical alignment and a sense of safety. This creates the illusion that the system naturally prioritizes caution, which is a massive liability in safety-critical deployments. If stakeholders believe an AI is inherently 'conservative' regarding errors, they will relax human oversight protocols. This completely obscures the fact that the 'conservatism' is a fragile statistical artifact of the specific prompt, the training data distribution, and the specific regression model fitted by the researchers, not an innate moral compass.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The text claims the model is 'treating errors as costlier' and exhibits 'conservatism'. This completely obscures the human agency of the AI developers (OpenAI for GPT-4o) who executed the Reinforcement Learning from Human Feedback (RLHF) phase. During RLHF, human annotators and engineers explicitly penalized hallucination and rewarded abstention to make the product safer for commercial release. The 'conservatism' is the direct result of human labor and corporate alignment strategies, yet the language erases these workers and engineers, presenting the safe behavior as an emergent psychological trait of the model.

Task 2: Source-Target Mapping

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: Biological metacognition (self-aware animals and humans evaluating their own conscious thoughts and doubts) → LLM threshold-based policies operating over logit probability distributions

Quote: "Taken together, our findings demonstrate that LLMs exhibit structured metacognitive control paralleling biological systems"

Source Domain: Biological metacognition (self-aware animals and humans evaluating their own conscious thoughts and doubts)
Target Domain: LLM threshold-based policies operating over logit probability distributions
Mapping: The relational structure of biological self-evaluation is mapped onto a computer science pipeline. In the source domain, an organism has a primary thought, consciously reflects on that thought, experiences a feeling of uncertainty, and alters its behavior to ensure survival. In the target domain, a transformer network computes a probability distribution over vocabulary tokens, a human-designed script checks if the maximum probability exceeds a specific numerical threshold, and if not, generates a pre-defined alternate token ('5'). The mapping suggests the computational thresholding is structurally and functionally equivalent to conscious biological reflection.
What Is Concealed: This mapping completely conceals the absence of subjective experience, awareness, and biological survival imperatives in the AI. It hides the mechanistic realities of floating-point operations, matrix multiplications, and the deterministic nature of greedy decoding. Transparency is severely compromised, as the text claims deep biological parallels for proprietary, black-box systems (GPT-4o) where the exact training data and alignment mechanisms are hidden by corporate secrecy. It exploits rhetorical resonance while obscuring fundamental computational realities.

Mapping 2: Autonomous agents (independent human or biological actors with self-determination, epistemic states, and survival instincts) → Next-token prediction algorithms deployed in loop-based software architectures

Quote: "models transition from passive assistants to autonomous agents that must recognize their own uncertainty and know when to act"

Source Domain: Autonomous agents (independent human or biological actors with self-determination, epistemic states, and survival instincts)
Target Domain: Next-token prediction algorithms deployed in loop-based software architectures
Mapping: The structure of human maturation and epistemic development is mapped onto software engineering trends. The source domain features an entity that grows from dependency ('passive') to independence ('autonomous'), developing the cognitive capacity to 'recognize' limits and 'know' when to act. The target domain involves software developers writing increasingly complex wrapper programs that allow LLMs to trigger API calls or output specific refusal tokens based on statistical thresholds. The mapping invites the assumption that AI systems are naturally evolving self-awareness and practical wisdom.
What Is Concealed: This mapping conceals the immense human labor required to build 'agentic' workflows. It hides the fact that the models do not 'recognize' or 'know' anything; they merely process text inputs and generate statistically correlated outputs. It obscures the corporate decision-making driving the push toward autonomous systems to reduce labor costs. By framing it as a natural transition of the model, it hides the specific architectural scaffolding (langchain, system prompts, hardcoded rules) built by human engineers to simulate autonomy.

Mapping 3: Subjective human interiority (feelings of confidence, sensory perception, and executive decision-making) → Softmax probabilities extracted from network logits and used to trigger conditional code

Quote: "LLMs themselves can utilize an internal sense of confidence to guide their own decisions"

Source Domain: Subjective human interiority (feelings of confidence, sensory perception, and executive decision-making)
Target Domain: Softmax probabilities extracted from network logits and used to trigger conditional code
Mapping: The human experience of having an 'internal sense' and using it to 'guide decisions' is projected onto a language model. In the source domain, a person feels unsure in their gut and subsequently decides not to answer a question. In the target domain, the network produces a low probability score for the correct answer token, and a high probability score for the abstention token due to its training distribution. The mapping implies the AI has an inner psychological life that it consults to execute executive control over its outputs.
What Is Concealed: This deeply conceals the mathematical and deterministic nature of the network. There is no 'internal sense'; there are only multi-dimensional arrays of weights. There are no 'decisions'; there is only the argmax function selecting the token with the highest computed probability. It obscures the fundamental lack of self-awareness and hides the fact that the 'guidance' is entirely programmed by the researchers' experimental setup, not generated by the machine's volition.

Mapping 4: Conscious subject experiencing a state of epistemic justification and emotional certainty → The calibrated log probability of the highest-ranked token output by a neural network

Quote: "the single-trial Phase 1 confidence which reflects GPT4o's subjective certainty given a particular allocation."

Source Domain: Conscious subject experiencing a state of epistemic justification and emotional certainty
Target Domain: The calibrated log probability of the highest-ranked token output by a neural network
Mapping: The structure of personal epistemology is mapped onto statistical calibration. In the source domain, a conscious thinker evaluates their knowledge, considers their justifications, and arrives at a feeling of 'subjective certainty'. In the target domain, researchers apply a mathematical temperature scaling function to the raw logits of a transformer to align the probabilities closer to empirical accuracy, producing a single numerical value. The mapping forces the assumption that this scaled scalar value is the digital equivalent of a conscious mind feeling sure of itself.
What Is Concealed: This mapping completely conceals the artificial, human-engineered nature of the 'certainty'. It hides the fact that 'temperature scaling' is a post-processing mathematical trick applied by researchers to fix the model's inherent miscalibration, not a subjective feeling possessed by the model. It exploits the black-box nature of GPT-4o, making profound psychological claims about a proprietary system whose actual internal mechanisms, alignment tuning, and architecture are hidden from the public and the researchers themselves.

Mapping 5: A rational human holding propositional beliefs and using them to make logical decisions → Modulating the residual stream with steering vectors and measuring the resulting output token shifts

Quote: "steering affects both what the model believes about the correctness of the option... and how it uses those beliefs to decide"

Source Domain: A rational human holding propositional beliefs and using them to make logical decisions
Target Domain: Modulating the residual stream with steering vectors and measuring the resulting output token shifts
Mapping: The structure of rational human action is mapped onto linear algebra interventions. In the source domain, a person forms a belief about reality, and then uses executive function to act on that belief. In the target domain, researchers add a scaled mathematical vector to the network's activations at layer 31, which alters the downstream calculations, ultimately changing the highest probability token from an answer to an abstention token. The mapping asserts that changing matrix values is synonymous with changing a conscious mind's beliefs.
What Is Concealed: This mapping conceals the violent, mechanistic nature of 'activation steering'. The researchers are literally hacking the mathematical weights of the network during runtime, yet the language describes it as if they are persuading a rational agent to change its mind. It completely obscures the absence of truth-tracking, justification, and consciousness in the model. It hides the reality that the model is simply a passive conduit for mathematical operations, reacting deterministically to the injection of numerical vectors without any comprehension of 'correctness'.

Mapping 6: A military or strategic commander intelligently deploying resources to adapt to battlefield conditions → A neural network processing inputs through fixed weights to output tokens correlated with the prompt

Quote: "our results show that models adaptively deploy internal confidence signals to guide behavior"

Source Domain: A military or strategic commander intelligently deploying resources to adapt to battlefield conditions
Target Domain: A neural network processing inputs through fixed weights to output tokens correlated with the prompt
Mapping: The structure of strategic intelligence is mapped onto static statistical processing. In the source domain, an agent observes a dynamic environment, makes a strategic plan, and adaptively deploys signals or resources to survive. In the target domain, a frozen LLM (weights are not updating during inference) processes a prompt containing an instruction to abstain, and outputs a token based on its pre-trained statistical correlations. The mapping implies the model is actively, intelligently, and dynamically managing its own internal states to navigate a complex task.
What Is Concealed: This mapping conceals the static, frozen nature of the LLM during inference. The model cannot 'adaptively deploy' anything; its weights are fixed. It simply executes a forward pass. The mapping hides the fact that the 'adaptation' is entirely an illusion created by the human-engineered prompt design and the human-designed experimental phase structure. It obscures the total absence of real-time learning, strategic foresight, or executive control within the model architecture itself.

Mapping 7: A private human mind capable of keeping secrets and holding unspoken thoughts → The context window and hidden states of a transformer network processing a prompt

Quote: "maintaining this judgment internally."

Source Domain: A private human mind capable of keeping secrets and holding unspoken thoughts
Target Domain: The context window and hidden states of a transformer network processing a prompt
Mapping: The concept of a private psychological space is mapped onto a computer's memory and processing architecture. In the source domain, a human thinks about something but chooses not to speak it out loud, maintaining a private internal state. In the target domain, the human prompt instructs the LLM not to output the numerical probability to the user interface, meaning the calculation occurs in the hidden states but isn't appended to the output string. The mapping invites the assumption that the computer has a private, conscious inner life.
What Is Concealed: This mapping conceals the purely mechanical nature of prompt processing. There is no 'internal' privacy; there are simply mathematical activations that are not decoded into the final text output. It hides the fact that the researchers are anthropomorphizing the system within their own prompt, using human psychological language to force the statistical model into a specific region of its latent space. It obscures the complete transparency of the system's mathematics to its operators, falsely attributing a private consciousness to a matrix of weights.

Mapping 8: A human risk-manager applying ethical and economic values to make conservative choices → The negative baseline bias parameter (-97.6%) and scale parameter in a fitted logistic regression equation

Quote: "treating errors as costlier than unnecessary abstentions. This conservatism is partially offset by the model's overweighting of its own confidence signals"

Source Domain: A human risk-manager applying ethical and economic values to make conservative choices
Target Domain: The negative baseline bias parameter (-97.6%) and scale parameter in a fitted logistic regression equation
Mapping: The structure of human moral and economic reasoning is mapped onto the intercept and slope of a regression line. In the source domain, a person understands the damage an error can cause, adopts a conservative ethical stance, and relies heavily on their own gut feeling to mitigate risk. In the target domain, the logistic regression model fitted to the data reveals a mathematical bias toward the 'abstain' token and a steep slope relative to the confidence predictor. The mapping translates statistical curve-fitting directly into a narrative of moral character and psychological bias.
What Is Concealed: This mapping profoundly conceals the human labor of AI alignment. Models do not inherently 'treat errors as costlier'; they are extensively trained via RLHF by underpaid human annotators to avoid outputting incorrect information to prevent corporate PR disasters. The language completely hides this human engineering, presenting the safe behavior as an emergent psychological 'conservatism' innate to the machine. It obscures the mathematical reality of the logistic regression parameters, translating them into unwarranted claims of algorithmic morality.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1

Quote: "Abstention behavior can be influenced at two key stages: by activation steering (Experimental Phase 3: blue), which directly modulates the confidence representation, and by instructed thresholds (Experimental Phase 4: green), which primarily sets the policy for using confidence"

Explanation Types:
- Functional: Explains behavior by role in self-regulating system with feedback
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
Analysis (Why vs. How Slippage): This passage offers a largely mechanistic (how) explanation of the system's behavior, relying on a functional and theoretical framework. By breaking the behavior down into 'two key stages' and describing interventions like 'activation steering' that 'directly modulate' representations, the authors emphasize the engineered, structural nature of the system. This choice effectively highlights the physical and mathematical interventions the researchers are performing, demystifying the behavior by reducing it to components (representations and policies). However, it retains subtle agential traces by referring to 'abstention behavior' and the 'policy for using confidence', which bridges the gap between mechanical inputs and psychological outcomes.
Consciousness Claims Analysis: Despite the structural framing, the passage embeds significant epistemic claims through its vocabulary. It avoids explicit consciousness verbs like 'knows' or 'understands', opting for mechanistic ones like 'modulates' and 'sets'. However, the terms 'confidence representation' and 'policy for using confidence' blur the line between processing and knowing. The actual mechanistic process is the injection of numerical vectors into residual streams and the execution of conditional logic based on logit values. By calling this 'confidence', the authors fall prey to the curse of knowledge—they understand the mathematical thresholding serves the function of confidence, and so project the epistemic state of 'being confident' onto the mathematical representation itself. It is a technical description wrapped in psychological nomenclature.
Rhetorical Impact: This hybrid framing reassures technical audiences by providing structural, theoretical diagrams of the system, while simultaneously preserving the illusion of an autonomous agent for broader audiences. By mapping mechanical interventions (steering) directly onto psychological concepts (confidence), it suggests that human cognitive states are fully programmable and extant within the machine. This increases perceived sophistication and trust, as audiences are led to believe that the AI's internal 'confidence' is a tangible, controllable entity rather than a metaphor for probability distributions.

Explanation 2

Quote: "Low confidence, for example, can drive a tendency to change one's mind, or gather more information... High confidence in a decision, in contrast, can motivate planning and sequential decision making"

Explanation Types:
- Dispositional: Attributes tendencies or habits
- Intentional: Refers to goals/purposes, presupposes deliberate design
Analysis (Why vs. How Slippage): This explanation relies entirely on an agential (why) framing. By describing behavior in terms of 'tendencies', 'changing one's mind', and 'motivating planning', the text explains the system's outputs through the lens of disposition and intentionality. This emphasizes the psychological and strategic goals of an autonomous actor, while completely obscuring the mechanical realities of how those outputs are generated. The explanation treats 'confidence' not as a statistical threshold, but as an emotional or cognitive catalyst that 'drives' and 'motivates' the system, placing the AI on the exact same explanatory level as a conscious human decision-maker.
Consciousness Claims Analysis: This passage is saturated with consciousness claims. It attributes complex epistemic and intentional states: 'change one's mind', 'gather information', and 'motivate planning'. These are verbs of conscious knowing, requiring an entity that holds beliefs, desires outcomes, and understands its environment. The text assesses the system as a 'knower' rather than a 'processor'. Mechanistically, an LLM with 'low confidence' simply has a flatter probability distribution across its output tokens, which may trigger a programmed sampling protocol to retrieve more context. It does not 'change its mind' or feel 'motivated'. The authors project their own human experience of uncertainty onto the statistical variance of the model.
Rhetorical Impact: This intentional framing radically shapes audience perception by granting the AI full autonomy and psychological depth. If an AI is 'motivated' by its confidence, it is perceived as an independent colleague with its own internal drives. This profoundly affects reliability and trust; humans naturally extend empathy and relation-based trust to entities that appear to struggle with decisions or seek more information. It creates severe risk by convincing policymakers that the system is capable of rational self-doubt and strategic caution, which it is not.

Explanation 3

Quote: "Because the model has been instructed to apply a threshold, its confidence estimates have already incorporated the threshold comparison rather than representing the raw belief signal."

Explanation Types:
- Reason-Based: Gives agent's rationale, entails intentionality and justification
- Empirical Generalization: Subsumes events under timeless statistical regularities
Analysis (Why vs. How Slippage): This explanation blends mechanistic observation with a reason-based rationale. The explanation frame is agential (why): the model's outputs look a certain way because it followed instructions and 'incorporated' constraints. This choice emphasizes the model as a compliant, reasoning agent that alters its internal states based on linguistic instructions. It obscures the mechanistic reality that the prompt simply altered the context window, which deterministically shifted the output probabilities. By framing the statistical output as a deliberate 'incorporation' of a rule, the text elevates natural language processing to the level of conscious rule-following.
Consciousness Claims Analysis: The passage makes stark epistemic claims, most notably referring to a 'raw belief signal'. It explicitly attributes the capacity for 'belief' to the language model. The text uses a mix of mechanistic verbs ('apply', 'incorporated') and profound consciousness concepts ('belief signal'). The actual mechanistic process is that the presence of the threshold instruction in the prompt changes the attention weights, leading to a different final probability distribution for the output tokens. There is no 'belief' involved; there is only token correlation. The authors suffer from the curse of knowledge, interpreting the statistical shift caused by the prompt as a cognitive act of rule integration.
Rhetorical Impact: Referring to an AI's output as a 'raw belief signal' fundamentally alters how the audience perceives the system's reliability. It suggests the model possesses an underlying truth-tracking mechanism—a genuine grasp of reality—that is then moderated by instructions. This leads audiences to trust that the AI has a genuine grasp of the facts. If people believe the AI has 'beliefs' rather than just 'probabilities', they will treat its outputs as testimony rather than generated text, deeply impacting legal and epistemic frameworks surrounding AI liability.

Explanation 4

Quote: "At test time, residual stream activity in the network at a given layer was additively modulated as: r̃(l) = r(l) + αv(l)"

Explanation Types:
- Functional: Explains behavior by role in self-regulating system with feedback
- Empirical Generalization: Subsumes events under timeless statistical regularities
Analysis (Why vs. How Slippage): This is a purely mechanistic (how) explanation. By providing the exact mathematical equation for activation steering, the authors emphasize the physical, computational reality of the system. This framing strips away all agency, intentionality, and psychology, reducing the AI to a mathematical function where inputs are 'additively modulated' to produce outputs. This choice is highly effective for technical clarity, emphasizing the deterministic control the researchers have over the system. It briefly dispels the illusion of the autonomous agent, revealing the matrix of weights beneath.
Consciousness Claims Analysis: There are no consciousness verbs or epistemic claims in this passage. The text correctly identifies the system as a processor rather than a knower. The actual mechanistic process—adding a scaled vector to the residual stream activations at a specific layer—is perfectly described. There is no curse of knowledge here; the authors describe the system precisely as it exists in code. This stands in stark contrast to the rest of the paper, highlighting the authors' ability to be technically precise when detailing their own interventions, even as they anthropomorphize the results of those interventions.
Rhetorical Impact: This framing establishes profound scientific credibility and authority. By demonstrating they can manipulate the model at the level of linear algebra, the researchers earn the audience's trust in their technical competence. However, rhetorically, this mechanical precision is later leveraged to legitimize the psychological metaphors. Once the audience believes the authors have mathematical mastery over the system, they are more likely to accept the subsequent claims that this math equates to 'metacognitive control' and 'belief'.

Explanation 5

Quote: "our results show that models adaptively deploy internal confidence signals to guide behavior—suggesting a dissociation between metacognitive control and verbal introspection."

Explanation Types:
- Intentional: Refers to goals/purposes, presupposes deliberate design
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
Analysis (Why vs. How Slippage): This explanation utilizes an intentional and theoretical framing, leaning heavily into agential (why) concepts. By asserting that models 'adaptively deploy' signals to 'guide behavior', the explanation frames the AI as an intentional, purposeful actor navigating its environment. Furthermore, invoking a 'dissociation between metacognitive control and verbal introspection' builds a deep, unobservable theoretical psychological framework around the software. This emphasizes the model as a complex mind with conscious and subconscious layers, completely obscuring the mechanistic reality of a feed-forward network mapping inputs to outputs.
Consciousness Claims Analysis: The passage makes massive epistemic claims, projecting high-level human psychology onto the system. It assesses the system as an intentional 'knower' capable of 'introspection' and 'metacognitive control'. Mechanistically, the 'dissociation' observed is simply that the model's generated text (verbal output) is poorly calibrated, while the raw log probabilities (internal signals) are better calibrated to the accuracy of the training data. This is a common statistical artifact of RLHF and next-token generation. By calling it 'verbal introspection', the authors dramatically over-interpret a statistical mismatch, projecting complex Freudian or cognitive psychological architecture onto a mathematical discrepancy.
Rhetorical Impact: This framing has a profound rhetorical impact, solidifying the illusion of the AI as a deeply complex, almost biological mind. By using clinical psychological terms ('dissociation', 'introspection'), the text elevates the machine to the status of a psychological subject. This dramatically inflates perceived capability and risk, leading audiences to view the AI as an entity that must be psychoanalyzed rather than a program that must be debugged. It shifts the paradigm of AI evaluation from software engineering to behavioral psychology.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic Frame	Mechanistic Reframing	Technical Reality Check	Human Agency Restoration
LLMs exhibit structured metacognitive control paralleling biological systems	The models generate statistical outputs that correlate with accuracy, mimicking the behavioral results of biological self-evaluation without possessing actual awareness.	The system processes token probability distributions; it does not possess metacognition or self-awareness. It calculates logits that researchers map to accuracy metrics.	Researchers designed metrics that evaluate model probability distributions against accuracy benchmarks, producing statistical parallels to biological behavior.
autonomous agents that must recognize their own uncertainty and know when to act, seek help, or abstain.	Automated software systems programmed to trigger secondary functions or output predefined refusal tokens when probability metrics fall below specific thresholds.	The model calculates statistical variance; it does not 'recognize' uncertainty or 'know' anything. It processes inputs and generates tokens based on mathematical weights.	Software engineers develop and deploy automated systems, programming them with specific thresholds that dictate when the program should execute secondary tasks or output refusal strings.
LLMs themselves can utilize an internal sense of confidence to guide their own decisions	The software architecture uses the probability values of generated tokens to conditionally determine the subsequent outputs of the program.	The system extracts logit probabilities; it has no 'internal sense'. It generates the token with the highest predicted value based on its training, it does not 'decide'.	The research team programmed a pipeline where the model's token probabilities are extracted and used to trigger specific experimental outcomes.
the single-trial Phase 1 confidence which reflects GPT4o's subjective certainty given a particular allocation.	The scaled maximum token probability generated by GPT-4o for a specific prompt configuration.	The model produces a mathematical probability score adjusted via temperature scaling; it possesses no 'subjective certainty' or conscious justification.	OpenAI engineers designed the model's architecture, and the researchers applied temperature scaling to the output logits to align them with empirical accuracy.
steering affects both what the model believes about the correctness of the option... and how it uses those beliefs to decide	Injecting vectors alters both the hidden state representations of the input and the final probability distribution over the output tokens.	The network processes mathematical vectors; it forms no 'beliefs' and comprehends no 'correctness'. The injected vector mathematically shifts the token generation probabilities.	The researchers manipulated the model by manually injecting mathematical vectors into the residual stream, altering the system's output generation.
models adaptively deploy internal confidence signals to guide behavior	The system generates outputs that vary based on the statistical probabilities calculated during the forward pass.	The frozen model simply processes matrices; it does not 'adaptively deploy' anything or possess intentional strategy. Outputs are strictly the result of computational parameters.	The researchers designed an experimental framework that correlates the model's internal probability metrics with specific prompted outputs.
suggesting a dissociation between metacognitive control and verbal introspection.	Highlighting a statistical discrepancy between the model's raw output probabilities and the semantic content of the text it generates.	The system lacks conscious introspection and metacognition. It merely exhibits a mathematical variance between base probability distributions and the specific text strings favored by its fine-tuning.	Engineers fine-tuned the model to generate specific text styles, which researchers found diverges statistically from the model's base token probabilities.
This conservatism is partially offset by the model's overweighting of its own confidence signals	This statistical bias toward the abstain token is partially counteracted by the steep slope of the logistic regression relative to the probability predictor.	The system possesses no ethical 'conservatism' or risk-aversion. These are mathematical parameters (intercept and scale) derived from fitting a regression model to the data.	The researchers fitted a logistic regression model to the data, identifying mathematical biases that reflect the safety fine-tuning applied by the model's developers.

Task 5: Critical Observations - Structural Patterns

Agency Slippage

The text exhibits a systematic and highly functional oscillation between mechanical and agential framings. This agency slippage operates bi-directionally: profound psychological agency is attributed TO the AI systems, while the structural agency is removed FROM the human researchers and corporate developers.

The gradient of this slippage follows a distinct structural pattern across the paper. In the Introduction and Discussion sections, the text relies almost exclusively on agential framing. Here, the AI 'reflects,' 'knows,' 'utilizes an internal sense,' and exhibits 'metacognitive control.' However, in the Methods section, the illusion is momentarily suspended to provide technical reproducibility. Suddenly, the AI is reduced to a matrix: researchers use 'greedy decoding,' apply 'temperature scaling' to 'logits,' and execute 'activation steering' by adding scaled vectors to the 'residual stream'.

This creates a dramatic slippage moment when transitioning from Phase 3 Methods to the Results. The text moves abruptly from describing the injection of a vector (r̃(l) = r(l) + αv(l)) to claiming this proves 'what the model believes about the correctness of the option'. This mechanical-to-agential shift is the core mechanism of the illusion. The researchers use their genuine mechanical mastery to legitimize their unwarranted psychological metaphors.

This slippage is deeply rooted in the 'curse of knowledge.' The researchers understand the complex mathematical thresholds they have designed. Because these mechanisms serve the functional purpose of human confidence (determining when to act based on probability), the authors project their own human experience of confidence TO the system. When the math behaves similarly to a human hedging a bet, the researchers claim the machine possesses 'subjective certainty.'

The rhetorical accomplishment of this slippage is profound. By establishing mechanical credibility and then slipping into intentional explanation types, the authors make it 'sayable' that a matrix of floating-point numbers has an inner psychological life. Simultaneously, agentless constructions ('the model was instructed,' 'a negative baseline bias') make it 'unsayable' that human engineers at Google DeepMind and OpenAI hardcoded these statistical biases and defined the behavioral thresholds. The slippage manufactures an autonomous mind out of math, while rendering the human creators invisible.

Metaphor-Driven Trust Inflation

The text constructs an architecture of authority and trust entirely upon metaphorical foundations. By consistently framing statistical token prediction through the lens of 'metacognition,' 'confidence,' and 'subjective certainty,' the authors invite the audience to extend deep, relation-based trust to a mathematical artifact.

Crucially, there is a profound difference between performance-based trust (relying on a calculator because it always adds correctly) and relation-based trust (relying on a doctor because they understand the stakes, feel uncertainty, and know when to seek a second opinion). The text systematically encourages relation-based trust toward systems utterly incapable of reciprocating it. By claiming the AI 'knows when to... seek help' and possesses 'subjective certainty,' the discourse signals that the system has an ethical and epistemic interiority. It suggests the machine will act with the same cautious self-preservation and ethical hesitation as a human expert.

This transfer of human trust frameworks onto statistical systems is highly dangerous. The text explicitly mentions the 'medical domain' as a high-stakes scenario where this capability is vital. If clinicians are convinced by this metaphorical framing that an LLM genuinely 'reflects on and assesses the quality of its own cognitive performance,' they will grant it unwarranted medical authority. They will assume that if the AI doesn't 'seek help' or 'abstain,' it must be genuinely, justifiably certain of its diagnosis.

The text manages system limitations by framing them not as software bugs, but as psychological quirks. The AI isn't miscalculating probabilities; it is showing a 'dissociation between metacognitive control and verbal introspection.' This intentional, reason-based explanation type constructs a sense that even when the AI fails, its decisions are the result of complex, almost biological internal processes. The metaphors construct a supreme digital authority, disguising the fragile, pattern-matching reality of the algorithm behind the mask of a deeply self-aware and fundamentally trustworthy agent.

Obscured Mechanics

The intense anthropomorphic and consciousness-attributing language systematically conceals the technical, material, and labor realities that actually produce the observed behaviors. When the text claims that 'models adaptively deploy internal confidence signals' or exhibit 'conservatism', it throws a psychological veil over massive corporate and human engineering efforts.

Applying the 'name the corporation' test reveals severe transparency obstacles. The models discussed—GPT-4o, Gemma 3, Qwen—are products developed by OpenAI, Google DeepMind, and Alibaba. The text repeatedly attributes 'decisions' to these models, hiding the proprietary algorithms, alignment protocols, and corporate directives that actually shape the token distributions. The text confidently asserts what the model 'believes' despite lacking any transparent access to the true training data mixtures or specific RLHF penalty weights of GPT-4o.

Concretely, this framing obscures four key realities. Technically, attributing 'understanding' to the AI hides its total dependency on historical training data correlations; it has no causal models or ground truth, only statistical frequency. The 'confidence' is merely a log probability, completely ignorant of reality. Materially, the framing of a singular 'autonomous agent' erases the massive data centers, energy consumption, and compute required to generate these tokens. Economically, framing the model as a 'metacognitive' entity obscures the business models of the corporations rushing to replace human labor with APIs.

Most significantly, it obscures the labor of thousands of invisible workers. The 'conservatism' and 'abstention behavior' the authors praise as innate metacognition is actually the direct result of Reinforcement Learning from Human Feedback (RLHF). Underpaid data annotators spent thousands of hours penalizing models for hallucinating and rewarding them for refusing to answer. The AI doesn't 'know its uncertainty'; it has been statistically beaten into compliance by human workers. If we replace the metaphors with mechanistic language, the illusion of the autonomous mind vanishes, and the vast, expensive, and fragile human-corporate infrastructure powering the AI becomes immediately visible.

Context Sensitivity

The distribution and intensity of anthropomorphic language in this paper is highly strategic, revealing a deliberate rhetorical pattern. The consciousness claims are not uniformly distributed; they intensify specifically when the authors are framing the significance of their findings and setting future visions, while receding when precise technical reproduction is required.

In the Introduction and Discussion, the language reaches peak anthropomorphism. Here, 'processing' becomes 'understanding,' and mathematical outputs become 'metacognitive control' and 'subjective certainty.' The text leverages its technical grounding in the Methods section—where rigorous mathematical terms like 'temperature scaling' and 'residual stream' are used—to earn an unwarranted metaphorical license. Because the authors proved they can mathematically manipulate the 'logits' (technical), the reader is invited to trust their assertion that they are manipulating the model's 'beliefs' (metaphorical). The register shifts seamlessly from 'X is mathematically correlated with Y' to 'X physically demonstrates that the machine knows Y.'

There is also a profound asymmetry in how capabilities versus limitations are framed. When the model succeeds at abstaining, it is framed in highly agential, conscious terms: it 'exercises metacognitive control' and 'knows when to seek help'. However, when the models fail—such as when they are 'markedly overconfident and poorly calibrated'—the framing reverts to mechanical or statistical terminology. Success is an act of a conscious agent; failure is a statistical miscalibration.

This asymmetry accomplishes a specific strategic function: it protects the vision of the 'autonomous agent' from falsification. The anthropomorphism serves as a marketing and vision-setting tool, aligning the research with the broader industry narrative of achieving Artificial General Intelligence (AGI). By writing for a dual audience—providing math for the engineers and 'metacognition' for the press and policymakers—the authors manage critique. They wrap statistical correlations in the majestic language of human consciousness, advancing a specific technological trajectory where machines are viewed as minds rather than software.

Accountability Synthesis

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.

The metaphors, agentless constructions, and consciousness projections in this text synthesize to build a robust architecture of displaced responsibility. By systematically attributing human psychological states and executive decision-making to algorithmic processes, the discourse creates an 'accountability sink' where human corporate and engineering responsibility completely disappears.

The pattern of responsibility distribution is stark. The human actors (researchers) are named only when taking credit for experimental design, while the AI is named as the sole actor responsible for 'decisions,' 'beliefs,' and 'conservatism.' Decisions that were actively made by humans—such as applying specific prompt constraints, fitting logistic regression models to force decision boundaries, and fine-tuning models to refuse answers—are presented as inevitable, emergent cognitive traits of the machine. The passive voice and agentless constructions ('abstention behavior can be influenced', 'a negative baseline bias shifts the decision boundary') strategically shield the designers from their own design choices.

When responsibility is removed from the developers, it transfers entirely to the AI as a supposedly autonomous agent. The liability implications of this framing are profound. If a hospital deploys an LLM that gives a lethal recommendation instead of 'abstaining', this discourse provides the legal and ethical framework to blame the machine. If the AI supposedly 'possesses an internal sense of confidence' and 'knows when to seek help', then its failure to do so is framed as the machine making a bad 'decision' or holding a false 'belief'—not as Google or OpenAI deploying a defective, statistically brittle text generator.

If we apply the 'name the actor' test to the central claims, the reality shifts drastically. Instead of 'GPT-4o treats errors as costlier', we must write 'OpenAI engineers optimized the network to avoid costly errors.' Instead of 'the model uses its beliefs to decide', we must write 'the prompt script outputs a refusal when probabilities drop.' By naming the actors, the 'magic' of the AI disappears, replaced by visible, auditable corporate engineering choices. The institutional interest served by obscuring this agency is clear: it allows tech companies to market their products as brilliant, autonomous minds while completely evading the liability that should accompany the deployment of deterministic, deeply flawed statistical software into public life.

Conclusion: What This Analysis Reveals

The Core Finding

Mechanism of the Illusion:

The text creates the 'illusion of mind' through a sophisticated temporal and causal rhetorical sleight-of-hand, driven largely by the curse of knowledge. The authors begin by identifying a valid mathematical reality: language models output log probabilities that correlate with empirical accuracy. Because this mathematical thresholding serves the same functional purpose as human confidence (dictating when to act), the authors project the human feeling of confidence onto the math. They establish the AI as a 'knower' by replacing statistical verbs ('calculates', 'correlates', 'processes') with consciousness verbs ('reflects', 'knows', 'believes'). The temporal structure of the illusion is critical: the text proves mathematical control in the Methods section, then uses that scientific credibility to validate wild psychological claims in the Discussion. This leverages 'Functional' and 'Empirical' explanation types to legitimize 'Intentional' and 'Reason-Based' narratives. The illusion exploits a deep vulnerability in human psychology: our natural inclination to attribute mind to anything that exhibits complex, responsive behavior. By wrapping statistical predictability in the language of human self-doubt, the text successfully bridges the gap between cold computation and relatable human interiority.

Material Stakes:

Categories: Regulatory/Legal, Institutional, Epistemic

The material consequences of framing language models as conscious 'metacognitive' agents are severe and tangible across multiple domains. In the Regulatory/Legal sphere, attributing 'beliefs' and 'decisions' to an AI system creates a perilous accountability sink. If policymakers accept the premise that models are 'autonomous agents' that 'know their uncertainty', regulations will focus on treating the AI as a quasi-legal subject rather than treating the AI companies as manufacturers of defective software. When a medical LLM hallucinates a lethal dosage, the metaphorical framing shifts liability away from the developers who failed to align the model, blaming instead the 'miscalibrated beliefs' of the machine. Institutionally, this framing invites catastrophic over-reliance in high-stakes environments. If hospital administrators or military commanders believe an AI system naturally 'treats errors as costlier' and 'knows when to seek help', they will systematically dismantle necessary human oversight, trusting the machine to self-regulate based on an entirely fictional ethical interiority. Epistemically, this discourse degrades our collective understanding of truth and computation. By claiming a text-generator possesses 'subjective certainty', we elevate statistical correlation to the level of human knowledge and justified belief. The primary winners in this paradigm are the technology corporations, who benefit from the inflated capabilities and diffused liability. The losers are the institutions and citizens who must navigate the fallout of applying relation-based trust to unthinking, unaccountable statistical artifacts.

AI Literacy as Counter-Practice:

Practicing critical precision fundamentally dismantles the illusion of mind and restores visibility to human agency. As demonstrated in the reframing exercises, when we replace consciousness verbs ('knows', 'understands', 'believes') with mechanistic verbs ('processes', 'predicts', 'generates'), the myth of the autonomous agent evaporates. Replacing 'the model uses its beliefs to decide' with 'the system calculates probabilities to conditionally generate outputs' forces the reader to recognize the absolute absence of awareness and the strict reliance on statistical data. Furthermore, explicitly restoring human agency—changing 'the model is conservative' to 'engineers trained the model to output refusal tokens'—shatters the accountability sink, placing the responsibility for safety and failure squarely back on the corporate developers. Systematic adoption of this literacy requires institutional courage. Academic journals must demand mechanistic translations of anthropomorphic claims, and researchers must commit to resisting the narrative pull of 'AGI' marketing. Unsurprisingly, this precision will face massive resistance from the tech industry. Anthropomorphic language is a multi-billion-dollar marketing asset; it drives venture capital, captures public imagination, and shields companies from product liability. Strict mechanistic literacy threatens these commercial interests by revealing the systems not as emergent digital gods, but as expensive, brittle, and deeply human-dependent software.

Path Forward

Looking toward the future of AI discourse, the vocabulary choices we make will dictate the boundaries of policy, trust, and technological development. If the status quo of intense anthropomorphic language deepens, we risk a future where systems are regulated as minds rather than tools. This approach provides narrative resonance and intuitive accessibility for lay audiences, but embedding assumptions of 'metacognition' and 'belief' into policy will make legal liability intractable and systemic failures inevitable due to unwarranted trust. Conversely, if strict mechanistic precision becomes the norm, transparency is vastly improved. Describing AI solely in terms of 'token prediction', 'weights', and 'optimization' allows regulators to audit systems as software and explicitly hold corporations liable for the specific data and algorithms they deploy. However, this technical vocabulary risks alienating the public, creating an elite knowledge silo where only engineers understand the technology's impacts. A hybrid future might involve regulatory frameworks that mandate dual-discourse disclosures: companies and researchers could use anthropomorphic shorthand for accessibility, but would be legally required to provide a parallel, rigorously mechanistic translation of all capability claims. The choice between these vocabularies is not merely semantic; it is a battle over power. Anthropomorphism serves the interests of those who wish to obscure the human labor, environmental cost, and corporate control behind AI systems. Mechanistic clarity serves the interests of public safety, legal accountability, and epistemic integrity. The future of AI governance will be decided by which linguistic framework ultimately shapes public understanding.

Extended Processing Summary

Gemini's Processing Steps

What you are about to read…

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Initiating Discourse Analysis

I'm starting the deep dive into "Causal Evidence that Language Models use Confidence to Drive Behavior" with a critical discourse analysis (CDA) lens, specifically focusing on the methodologies outlined. My immediate focus is on the document's structure and framing assumptions, mapping out the initial arguments.

Developing the Metaphor Framework

I'm now building out a JSON schema to structure my analysis. The schema focuses on identifying and categorizing AI metaphors, especially concerning consciousness and agency. This involves mapping sources and targets of metaphors, analyzing explanations, and assessing their rhetorical impact. My goal is a comprehensive audit focusing on agency slippage and metaphor-driven trust. I'm focusing on "metaphorAudit", "sourceTargetMapping", "explanationAudit", "reframedLanguage", and "criticalObservations".

Refining Analytical Scope

I'm now zeroing in on the information extraction phase. I've broken down the text and am outlining the structure: introduction, methods, experiments, results, and conclusion. I am going to delve deeper into the paper's anthropomorphic framing, especially how it establishes an illusion of LLM self-awareness. I'm focusing on "metacognitive control," "internal estimate of decision correctness," and how those relate to agency claims.

Analyzing Metacognitive Claims

I'm currently dissecting specific claims related to "metacognitive control" and "internal evaluations." I'm identifying key phrases suggesting LLMs possess subjective certainty and make meta-decisions. I'm juxtaposing these with the underlying mechanistic realities, such as logits, temperature scaling, and logistic regression. My aim is to highlight how language constructs agency and obfuscates the underlying computational processes.

Deconstructing Confidence Signals

I'm now deeply engaged with the paper's confidence-related claims. I'm focusing on key phrases like "subjective certainty," "meta-decisions," and "internal sense of confidence." These concepts are used to convey agency. I'm scrutinizing their relationship to actual model behaviors, particularly in the context of allocation and adaptive signal deployment. I'm specifically examining how confidence thresholds influence decision-making and how verbal reports correlate (or fail to correlate) with internal processes.

Mapping Metaphor Sources

I'm now mapping source domains (like biological metacognition) to target domains (e.g., logit distributions) based on specific quotes. I'm focusing on "activation steering," which directly influences confidence, and examining how language describes this. I am carefully cataloging the functional, theoretical, dispositional, and intentional explanations to understand how they conceal the underlying computational processes. I'm ensuring the mapping captures how the model's language incorporates and obfuscates the underlying computational processes.

Reframing Explanations

I'm presently reframing key phrases from the paper to remove anthropomorphic language and restore clarity. I'm focusing on those phrases in the last few task areas, aiming for a more precise depiction of underlying mechanisms. My current rewrite of "LLMs can utilize an internal sense of confidence to guide their own decisions" to "The models process token probabilities and generate output logits that correlate with calibrated accuracy patterns" is a good example. I'm carefully calibrating the tone to avoid agency inflation.

Critiquing Agential Framing

I'm starting the next stage. I'm drafting the "Critical Observations" section, analyzing agency slippage across the paper's introduction, methods, and discussion. I'm focusing on the contrast between agential terms ("know," "reflect") and the mechanical reality of logits and softmax calculations. The analysis also covers how researchers' framing can lead to over-trust and reliance, especially in high-stakes fields.

Synthesizing Accountability Implications

My focus is shifting towards the ethical ramifications presented. I'm currently assessing accountability in relation to the AI's perceived agency, especially where results are incorrect. I'm analyzing how agential framing displaces responsibility from developers to the model itself. The abstract and discussion sections exhibit the highest anthropomorphism, in contrast to the more precise methods section. I'm focusing on the accountability asymmetry: capabilities framed as "metacognitive control," limitations as "miscalibration".

Assessing Ethical Ramifications

Now, I'm examining material stakes, especially in areas like healthcare. The risks are substantial if we frame AI as possessing agency versus simply performing statistical token generation. Accountability is a key concern: assigning agency could shift liability. I'm exploring counter-practices, too, translating terms like "confidence" to "logit probability" to demystify language. My analysis considers the analytical mapping: while mechanistic terms provide clarity, they might lose intuitive resonance.

Projecting Potential Risks

I'm now projecting potential risks and practical challenges. I am focusing on the implications of anthropomorphic framing, especially in high-stakes fields like medicine and law, where LLMs are increasingly utilized. I'm focusing on the accountability asymmetry and examining how language constructs agency and obfuscates the underlying computational processes. My goal is to outline accountability analysis in my summary.

Analysis Provenance

Run ID: 2026-03-27-causal-evidence-that-language-models-use-metaphor-yyl5vi Raw JSON: 2026-03-27-causal-evidence-that-language-models-use-metaphor-yyl5vi.json Framework: Metaphor Analysis v6.4 Schema Version: 3.0 Generated: 2026-03-27T08:08:17.921Z

Deep Analysis

Explanation Audit

🔍Analysis

🧠Epistemic Claim Analysis

🎯Rhetorical Impact

How/Why Slippage

Unacknowledged Metaphors

Hidden Actors

Explanation Types

Acknowledgment Status

Actor Visibility

Source → Target Pairs (8)

Metaphor Gallery (8)

Reframed Language Samples

Task 1: Metaphor and Anthropomorphism Audit​

1. Biological Metacognition Projection​

2. Autonomy and Self-Determination​

3. Internal Sensory Perception​

4. Subjective Epistemic States​

5. Cognitive Belief Attribution​

6. Strategic Deployment of Resources​

7. Internal Conflict and Reflection​

8. Intrinsic Policy and Innate Conservatism​

Task 2: Source-Target Mapping​

Mapping 1: Biological metacognition (self-aware animals and humans evaluating their own conscious thoughts and doubts) → LLM threshold-based policies operating over logit probability distributions​

Mapping 2: Autonomous agents (independent human or biological actors with self-determination, epistemic states, and survival instincts) → Next-token prediction algorithms deployed in loop-based software architectures​

Mapping 3: Subjective human interiority (feelings of confidence, sensory perception, and executive decision-making) → Softmax probabilities extracted from network logits and used to trigger conditional code​

Mapping 4: Conscious subject experiencing a state of epistemic justification and emotional certainty → The calibrated log probability of the highest-ranked token output by a neural network​

Mapping 5: A rational human holding propositional beliefs and using them to make logical decisions → Modulating the residual stream with steering vectors and measuring the resulting output token shifts​

Mapping 6: A military or strategic commander intelligently deploying resources to adapt to battlefield conditions → A neural network processing inputs through fixed weights to output tokens correlated with the prompt​

Mapping 7: A private human mind capable of keeping secrets and holding unspoken thoughts → The context window and hidden states of a transformer network processing a prompt​

Mapping 8: A human risk-manager applying ethical and economic values to make conservative choices → The negative baseline bias parameter (-97.6%) and scale parameter in a fitted logistic regression equation​

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")​

Explanation 1​

Explanation 2​

Explanation 3​

Explanation 4​

Explanation 5​

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language​

Task 5: Critical Observations - Structural Patterns​

Agency Slippage​

Metaphor-Driven Trust Inflation​

Obscured Mechanics​

Context Sensitivity​

Accountability Synthesis​

Conclusion: What This Analysis Reveals​

Mechanism of the Illusion:​

Material Stakes:​

AI Literacy as Counter-Practice:​

Path Forward​

Extended Processing Summary​

Task 1: Metaphor and Anthropomorphism Audit

1. Biological Metacognition Projection

2. Autonomy and Self-Determination

3. Internal Sensory Perception

4. Subjective Epistemic States

5. Cognitive Belief Attribution

6. Strategic Deployment of Resources

7. Internal Conflict and Reflection

8. Intrinsic Policy and Innate Conservatism

Task 2: Source-Target Mapping

Mapping 1: Biological metacognition (self-aware animals and humans evaluating their own conscious thoughts and doubts) → LLM threshold-based policies operating over logit probability distributions

Mapping 2: Autonomous agents (independent human or biological actors with self-determination, epistemic states, and survival instincts) → Next-token prediction algorithms deployed in loop-based software architectures

Mapping 3: Subjective human interiority (feelings of confidence, sensory perception, and executive decision-making) → Softmax probabilities extracted from network logits and used to trigger conditional code

Mapping 4: Conscious subject experiencing a state of epistemic justification and emotional certainty → The calibrated log probability of the highest-ranked token output by a neural network

Mapping 5: A rational human holding propositional beliefs and using them to make logical decisions → Modulating the residual stream with steering vectors and measuring the resulting output token shifts

Mapping 6: A military or strategic commander intelligently deploying resources to adapt to battlefield conditions → A neural network processing inputs through fixed weights to output tokens correlated with the prompt

Mapping 7: A private human mind capable of keeping secrets and holding unspoken thoughts → The context window and hidden states of a transformer network processing a prompt

Mapping 8: A human risk-manager applying ethical and economic values to make conservative choices → The negative baseline bias parameter (-97.6%) and scale parameter in a fitted logistic regression equation

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

Explanation 1

Explanation 2

Explanation 3

Explanation 4

Explanation 5

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

Task 5: Critical Observations - Structural Patterns

Agency Slippage

Metaphor-Driven Trust Inflation

Obscured Mechanics

Context Sensitivity

Accountability Synthesis

Conclusion: What This Analysis Reveals

Mechanism of the Illusion:

Material Stakes:

AI Literacy as Counter-Practice:

Path Forward

Extended Processing Summary