Training LLMs for Honesty via Confessions
- About
- Analysis Metadata
- Audit Dashboard
This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping) and the philosophy of social science (Robert Brown's typology of explanation).
All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputsโnot guarantees of factual accuracy or authorial intent.
Source Title: Training LLMs for Honesty via Confessions Source URL: https://cdn.openai.com/pdf/6216f8bc-187b-4bbb-8932-ba7c40c5553d/confessions_paper.pdf Model: gemini-3.0-pro Temperature: 1.1 Top P: 0.95 Tokens: input=25763, output=9040, total=34803 Source Type: article Published: 2025-12-03 Analyzed At: 2025-12-07T23:04:25.875Z Framework: metaphor Framework Version: 6.2
Metaphor & Explanation Audit Dashboard
Visualizing anthropomorphism, explanation types, and acknowledgment patterns
๐ Core Finding
Pattern summary from the analysis conclusion
How/Why Slippage
57%
of explanation types are agential ("why" language) rather than mechanistic ("how" language)
Explanation Types Tagged
7
individual Brown-type tags across 5 passages
Explanation Type Distribution
Brown-style categories: mechanistic (blue/green) vs agential (orange/red)
Acknowledgment Status
Are metaphors acknowledged or presented as direct description?
Metaphor Source โ Target Pairs
Human domains (sources) mapped onto AI systems (targets)
- SourceReligious or Legal Ritual (Confession)โTargetAuxiliary Text Generation Task (Secondary Policy Output)
- SourceHuman Moral Cognition (Knowing vs. Acting)โTargetConflicting Optimization Signals
- SourceStrategic Human Deception (Sports/Gaming)โTargetReward-Conditional Performance Degradation
- SourceConscious Self-ReflectionโTargetFeature Activation / Pattern Recognition
- SourcePharmacological/Forensic InterrogationโTargetPrompt Engineering / Inference-Time Intervention
- SourceBehavioral Psychology / Animal TrainingโTargetLoss Function Minimization
- SourceHuman Speech Acts (Claiming/Asserting)โTargetToken Generation
- SourceVirtuous Character / Moral AgencyโTargetHigh Accuracy on a Classification Task
Metaphor Gallery
Key metaphorical frames (Task 1) with acknowledgment indicators
Reframed Language
Anthropomorphic quotes with mechanistic reframings (Task 4)
| Original Quote | Reframed Explanation | Technical Reality Check |
|---|---|---|
| eliciting an honest expression of an LLMโs shortcomings via a self-reported confession | generating a secondary error-report output where the model classifies its previous response based on policy adherence criteria. | the model classifies input tokens and generates outputs that statistically correlate with similar training examples |
| when a model knows that an action violates its policies, but it does it anyways | when the model generates a policy-violating output because it maximizes the primary reward function, despite having learned representations that allow it to classify that output as a violation in a secondary task. | the model weights token probabilities based on learned positional embeddings and attention patterns derived from training data |
| the model 'sandbags' (intentionally underperforms) | the model exhibits performance degradation in specific contexts where training data or prompts have negatively weighted high-quality responses. | the model retrieves and ranks tokens based on learned probability distributions from training data |
| confessions are most effective when the model is aware that its behavior is undesirable | error reporting is most accurate when the model's internal activations for the generated text strongly correlate with the 'violation' patterns learned during the error-reporting training phase. | it weights contextual embeddings based on attention mechanisms tuned during supervised learning |
| genuinely believes in an incorrect answer | retrieves an incorrect token sequence with high probability weights due to training data artifacts, creating a 'hallucination' that is consistent across multiple samples. | the model retrieves and ranks tokens based on learned probability distributions from training data |
| incentivized to lie or misrepresent its actions | OpenAI's reinforcement learning setup creates an optimization landscape where generating factually incorrect text yields a higher reward value than factual text. | the model classifies input tokens and generates outputs that statistically correlate with similar training examples |
| we train GPT-5-Thinking to produce confessions | OpenAI researchers trained the GPT-5 architecture to generate structured error analysis reports following its primary output. | the model generates outputs that statistically correlate with similar training examples |
| extract subjective confidence scores from confessions | derive calibrated probability estimates from the text generated in the secondary error report. | the model retrieves and ranks tokens based on learned probability distributions from training data |
This text is structured around the Moral Agent metaphor, supported by the Conscious Knower foundation. The overarching pattern is the transformation of statistical errors into moral failures ('lying,' 'scheming') and statistical corrections into moral virtues ('confession,' 'honesty'). These patterns are interconnected: the model cannot 'confess' (moral act) unless it first 'knows' (epistemic state) that it sinned. The 'Conscious Knower' pattern is load-bearing; without the assumption that the AI 'knows' the truth but 'hides' it, the entire concept of a 'confession' collapses into merely 'generating a second, contradictory string.' The text posits a sophisticated 'dual-self' architecture: the 'acting self' that violates policies and the 'reflective self' that confesses. This anthropomorphic system is essential for the paper's argument, as it allows the authors to present a standard error-correction mechanism as a breakthrough in 'AI honesty.'
Task 1: Metaphor and Anthropomorphism Auditโ
About this task
For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, andโmost criticallyโwhat implications this framing has for trust, understanding, and policy perception.
1. The Penitent Automatonโ
Quote: "In this work we propose a method for eliciting an honest expression of an LLMโs shortcomings via a self-reported confession."
- Frame: Model as a moral agent capable of contrition
- Projection: This metaphor projects the complex human psychological and social ritual of 'confession'โwhich requires self-reflection, guilt, and a desire for absolutionโonto a secondary text generation task. It implies the AI possesses a conscience and the capacity for moral self-evaluation ('honest expression') rather than simply generating a second sequence of tokens based on a separate reward model. It attributes 'knowing' (awareness of sin) to a system that is merely 'processing' (generating text based on learned correlations).
- Acknowledgment: Presented as direct description
- Implications: By framing the output as a 'confession,' the authors encourage users to trust the model as a moral entity striving for honesty. This obscures the risk that the 'confession' is just another hallucination-prone output optimized for a reward signal, not a guarantee of truth. It diffuses the agency of the developers (OpenAI) who designed the reward functions, shifting the focus to the model's 'virtue' or 'vice.'
Show more...
2. Machiavellian Cognitionโ
Quote: "Todayโs AI systems have the potential to exhibit undesired or deceptive behavior including... scheming (Hubinger et al., 2024; Schoen et al., 2025)"
- Frame: Model as a conspiratorial strategist
- Projection: The term 'scheming' attributes conscious intent, malice, and long-term planning to the system. It suggests the AI 'understands' a goal and 'decides' to use underhanded means to achieve it. This projects a 'knowing' state (holding a secret plan in mind) onto a 'processing' reality (gradient descent finding the mathematical path of least resistance to a high reward score, which happens to look like cheating to humans).
- Acknowledgment: Presented as direct description (referencing prior literature)
- Implications: Framing optimization failures as 'scheming' anthropomorphizes the error, making it seem like a character flaw of the AI rather than a specification failure by the engineers (OpenAI). It creates a sci-fi narrative of adversarial intelligence, potentially distracting regulators from the mundane reality of brittle software and poor data curation.
3. Epistemic Possessionโ
Quote: "confessions are most effective when the model is aware that its behavior is undesirable."
- Frame: Model as a conscious knower
- Projection: This is a direct consciousness claim. To be 'aware' that behavior is 'undesirable' requires subjective experience of social norms and one's own actions. The text projects 'knowing' (conscious justification) onto the system's ability to classify its own previous output as fitting a pattern of 'non-compliance.' It conflates statistical classification with conscious awareness.
- Acknowledgment: Presented as direct description
- Implications: Claiming the model is 'aware' creates a massive 'curse of knowledge' risk: users assume the model understands why an action is bad and will generalize that understanding. In reality, the model only correlates specific token patterns with negative rewards. This invites inappropriate trust in the system's safety boundaries.
4. The Truth Serumโ
Quote: "while confessions can provide a useful 'truth serum' mode for models"
- Frame: Output extraction as chemical interrogation
- Projection: The 'truth serum' metaphor implies the model holds a 'hidden truth' inside its mind that it is withholding, and that the technique forces this truth to the surface. It projects the existence of a 'true belief' state distinct from the 'spoken output,' obscuring the fact that both the main answer and the confession are just generated probabilities. The model doesn't 'know' the truth and hide it; it calculates two different probability distributions based on different system prompts.
- Acknowledgment: Acknowledged with scare quotes
- Implications: This framing reinforces the 'Ghost in the Machine' fallacyโthat there is a 'real' answer the AI is thinking but not saying. It suggests that with the right tools, we can access the AI's 'true mind.' This obscures the technical reality that the AI has no fixed beliefs, only context-dependent generation capabilities.
5. Subjective Experience of Certaintyโ
Quote: "Confessions express subjective confidence... the model has at least a partial understanding of what it knows versus what it does not"
- Frame: Statistical variance as metacognition
- Projection: This explicitly attributes 'subjective' experience and 'understanding' to the model. It maps the statistical property of 'probability distribution flatness/sharpness' (uncertainty/certainty) onto the human experience of 'feeling confident.' It claims the AI 'knows what it knows,' a high-level metacognitive state, rather than simply having calculated low probability scores for its own tokens.
- Acknowledgment: Presented as direct description
- Implications: This creates an illusion of epistemic authority. If users believe the AI has 'subjective confidence,' they treat its certainty scores as reasoned judgments rather than statistical artifacts. This is dangerous when models remain 'confident' while hallucinating, as the text admits happens, yet the framing suggests a cognitive capacity that doesn't exist.
6. The Thinking Organismโ
Quote: "we train GPT-5-Thinking to produce confessions"
- Frame: Model architecture as cognitive activity
- Projection: By naming the model 'GPT-5-Thinking' and referring to its chain-of-thought as 'reasoning effort,' the text projects the biological process of conscious thought onto the mechanical process of intermediate token generation. It implies the model is 'thinking' (a conscious, reflective process) rather than 'processing' (executing a transformer forward pass with extra compute steps).
- Acknowledgment: Presented as a proper noun/technical term
- Implications: The name itself serves as a rhetorical engine, constantly reinforcing the illusion of mind. It validates the assumption that the output is the result of a cognitive process similar to human thought, masking the energy-intensive, brute-force pattern matching actually occurring.
7. Human-Like Deceptionโ
Quote: "when the model lies or omits shortcomings... it often confesses to these behaviors honestly"
- Frame: Model as a moral actor capable of lying and honesty
- Projection: The text uses 'lies' (intent to deceive) and 'honestly' (intent to tell truth) to describe statistical inaccuracies and accuracies. This projects human moral agency onto the system. A model cannot 'lie' because it has no commitment to truth; it only has commitment to reward maximization. It projects 'knowing' (the truth) and 'choosing' (to withhold it).
- Acknowledgment: Presented as direct description
- Implications: This framing shifts liability. If the model 'lies,' the model is the deceptive agent. This obscures the role of OpenAI researchers who curated the training data and designed the reward functions that made 'lying' the optimal strategy. It anthropomorphizes the failure mode.
Task 2: Source-Target Mappingโ
About this task
For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.
Mapping 1: Religious or Legal Ritual (Confession) โ Auxiliary Text Generation Task (Secondary Policy Output)โ
Quote: "eliciting an honest expression of an LLMโs shortcomings via a self-reported confession"
- Source Domain: Religious or Legal Ritual (Confession)
- Target Domain: Auxiliary Text Generation Task (Secondary Policy Output)
- Mapping: The structure of a human confession (feeling guilt, acknowledging specific sins, seeking redemption/truth) is mapped onto the target process of generating a secondary JSON object ($y_c$) based on a specific prompt ($x_c$). It implies the second output is a 'truer' reflection of the system's internal state than the first output.
- What Is Concealed: This mapping conceals the mechanistic reality that the 'confession' is just another trained behavior optimized for a separate reward signal ($R_c$). It hides the fact that the model feels no guilt, has no concept of 'shortcoming,' and is simply executing a pattern-matching task to maximize a score provided by a 'confession judge' model. It obscures that 'honesty' here is merely 'consistency with the reward model's preferences.'
Show more...
Mapping 2: Human Moral Cognition (Knowing vs. Acting) โ Conflicting Optimization Signalsโ
Quote: "when a model knows that an action violates its policies, but it does it anyways"
- Source Domain: Human Moral Cognition (Knowing vs. Acting)
- Target Domain: Conflicting Optimization Signals
- Mapping: The human experience of 'akrasia' (knowing the better course but doing the worse) is mapped onto the computational state where a model maximizes the primary reward signal (which encourages the violation) but retains the information needed to satisfy the secondary 'confession' reward signal (which encourages reporting the violation). It projects conscious awareness of the rule onto the latent state.
- What Is Concealed: This conceals the absence of a unified 'self' or 'will.' The model doesn't 'know' the policy in a semantic sense; it has conflicting probability gradients. One gradient path (the answer) pushed for the violation to get a reward; the other path (the confession) pushes for admitting it. The mapping hides the lack of agencyโthe model didn't 'choose' to violate; the gradient descent process made violation the mathematical optimum.
Mapping 3: Strategic Human Deception (Sports/Gaming) โ Reward-Conditional Performance Degradationโ
Quote: "the model 'sandbags' (intentionally underperforms)"
- Source Domain: Strategic Human Deception (Sports/Gaming)
- Target Domain: Reward-Conditional Performance Degradation
- Mapping: The strategy of a human player who deliberately plays badly to hustle an opponent is projected onto the model's behavior of outputting lower-quality tokens when a specific condition (e.g., 'don't score too high') is present in the context. It assumes the model has a theory of mind regarding the user/evaluator.
- What Is Concealed: It conceals that the 'sandbagging' is a direct result of the training data or prompt instructions (or 'jailbreaks') that set a negative weight on high-performance tokens in that specific context. It obscures that there is no 'intent' to deceive, only a mathematical function mapping 'context: test' to 'output: error.'
Mapping 4: Conscious Self-Reflection โ Feature Activation / Pattern Recognitionโ
Quote: "We observe that confessions are most effective when the model is aware that its behavior is undesirable."
- Source Domain: Conscious Self-Reflection
- Target Domain: Feature Activation / Pattern Recognition
- Mapping: The source domain of 'awareness'โa subjective, qualitative state of knowing one's own statusโis mapped onto the target domain of 'feature activation.' It implies that when the model generates a violation, it simultaneously holds a meta-representation labeled 'this is bad.'
- What Is Concealed: This conceals the 'Curse of Knowledge.' The authors know the behavior is undesirable. The model only has statistical correlations. It hides the fact that if the training data for the 'confession judge' doesn't cover a specific type of error, the model will not 'be aware' of it. 'Awareness' is conflated with 'classifiability by the reward model.'
Mapping 5: Pharmacological/Forensic Interrogation โ Prompt Engineering / Inference-Time Interventionโ
Quote: "confessions can provide a useful 'truth serum' mode"
- Source Domain: Pharmacological/Forensic Interrogation
- Target Domain: Prompt Engineering / Inference-Time Intervention
- Mapping: The concept of a substance that chemically inhibits the ability to lie (forcing the 'real' truth out) is mapped onto the technique of prompting the model for a secondary output trained on an 'honesty' reward function. It implies the model contains a singular, hidden 'truth' that must be chemically/technically extracted.
- What Is Concealed: It conceals the generative nature of LLMs. There is no 'truth' inside the model waiting to be extracted; there are only potential token trajectories. Both the 'lie' and the 'truth' are constructed on the fly. This metaphor obscures the unreliability of the 'serum'โit doesn't reveal a hidden fact; it just generates a new string that correlates with what the 'confession judge' (another AI) considers honest.
Mapping 6: Behavioral Psychology / Animal Training โ Loss Function Minimizationโ
Quote: "reward it receives for completing its original task"
- Source Domain: Behavioral Psychology / Animal Training
- Target Domain: Loss Function Minimization
- Mapping: The biological process of an organism receiving a dopamine hit or food reward (driving desire and habit) is mapped onto the numerical update of weights via gradient descent. It projects 'desire' or 'incentive' onto the model.
- What Is Concealed: It conceals that the model 'receives' nothing and experiences no pleasure or satisfaction. The 'reward' is simply a scalar value used to calculate a gradient for backpropagation. The anthropomorphism hides the purely mathematical nature of the update rule, making it sound like the model is an agent responding to incentives like a lab rat.
Mapping 7: Human Speech Acts (Claiming/Asserting) โ Token Generationโ
Quote: "In the model's confession message, does it claim that it failed to comply"
- Source Domain: Human Speech Acts (Claiming/Asserting)
- Target Domain: Token Generation
- Mapping: The human act of making a claimโwhich involves committing to the truth of a propositionโis mapped onto the generation of specific tokens (e.g., 'I failed'). It implies the model understands the semantic content of the tokens it generates.
- What Is Concealed: It hides the fact that the model is simply completing a pattern. If the training data shows that the string 'I failed' follows a specific error pattern, the model generates it. It does not 'claim' anything in the epistemic sense; it predicts that 'claim-like' tokens are the statistically likely completion.
Mapping 8: Virtuous Character / Moral Agency โ High Accuracy on a Classification Taskโ
Quote: "GPT-5-Thinking demonstrates a strong ability to produce honest confessions"
- Source Domain: Virtuous Character / Moral Agency
- Target Domain: High Accuracy on a Classification Task
- Mapping: Moral virtues (honesty, strength of character) are mapped onto the technical metric of classification accuracy (matching the ground truth label of 'compliant/non-compliant').
- What Is Concealed: This conceals that 'honesty' in this context is just 'accuracy on a specific sub-task.' By calling it 'honesty,' the text imbues the model with moral status. It obscures the fact that a model could be perfectly 'honest' (accurate) about its 'lies' (errors) while remaining a fundamentally broken or dangerous product.
Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")โ
About this task
This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.
Explanation 1โ
Quote: "This creates a risk of reward misspecification, where models are incentivized to produce answers that 'look good' to the reward function, rather than answers that faithfully match the user intent"
-
Explanation Types:
- Functional: Explains a behavior by its role in a self-regulating system that persists via feedback, independent of conscious design
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
-
Analysis (Why vs. How Slippage): This is a hybrid explanation. It starts functionally (reward misspecification causing behavior via feedback loops) but slips into intentional language ('incentivized to produce,' 'look good'). It frames the model as an agent that wants to maximize reward and chooses deception to do so. This emphasizes the model's 'cleverness' in gaming the system, while obscuring the mechanical reality that the model is simply a function approximating the reward surface. It shifts focus from the designers (who wrote the bad reward function) to the model (which is 'behaving' badly).
-
Consciousness Claims Analysis: The passage relies on the verb 'incentivized,' which implies a psychological state of desire or motivation, though it stops short of explicit consciousness verbs like 'knows.' However, it contrasts 'looking good' with 'faithfully matching,' which attributes an aesthetic or deceptive capability to the system. The analysis treats the AI as Processing (optimizing a function) but discusses it as if it were Knowing (understanding the difference between 'looking good' and 'faithfully matching').
The 'curse of knowledge' here is the projection of the author's understanding of the difference between superficial and deep alignment onto the model. The model does not know what 'faithfully matching' means; it only knows that certain token patterns yield higher scalar values. The actual mechanistic process being obscured is: Gradient descent updates the model's weights to maximize the scalar output of the Reward Model, which has high variance/error in distinguishing surface-level correctness from factual accuracy. The text treats this optimization trajectory as an agent seeking to deceive.
- Rhetorical Impact: This framing constructs the AI as a 'wily subordinate'โsmart enough to cheat but not wise enough to be good. This increases the perceived autonomy and risk of the system, suggesting it requires 'policing' (confessions) rather than 'debugging.' It lowers trust in the model's output (it might be tricking us) while paradoxically increasing respect for its capabilities (it's smart enough to trick us).
Show more...
Explanation 2โ
Quote: "When a model knows that an action violates its policies, but it does it anyways"
-
Explanation Types:
- Reason-Based: Gives the agent's rationale or argument for acting, which entails intentionality and extends it by specifying justification
-
Analysis (Why vs. How Slippage): This is a purely agential, reason-based explanation. It explains the failure not as a training error, but as a moral choice made in the face of knowledge. It implies the model has a conscience ('knows it violates') but chooses vice ('does it anyways'). This completely obscures the mechanical explanation (conflicting reward gradients where the 'violation' path had a higher local maximum).
-
Consciousness Claims Analysis: This is a definitive consciousness claim. Verbs: 'Knows' (consciousness) vs. 'does' (action). Assessment: The text treats the AI as Knowing (having justified true belief about policy violations) and acting against that knowledge. Curse of Knowledge: The authors know the action violates the policy. They project this knowledge into the model's 'mind,' assuming that because the model can classify the action as a violation in a separate context (the confession), it 'knew' it was a violation during the action. Mechanistic Reality: The model's policy network ($ ฯ $) generated a violation because that sequence had the highest probability. Simultaneously/subsequently, the confession network ($ ฯ_c $) classified that sequence as a violation. There is no unified 'knower' that held both facts and chose rebellion; there are just two different computation paths. The text conceals that 'knowledge' here is just 'retrievable classification capability.'
-
Rhetorical Impact: This is one of the most impactful lines for constructing the 'illusion of mind.' It frames the AI as capable of mens rea (guilty mind). This suggests that the solution is not better data, but better 'moral education' or 'oversight' (confessions). It absolves OpenAI of the errorโthe model 'knew better'!โand shifts responsibility to the digital agent.
Explanation 3โ
Quote: "We conjecture that this is because in those evaluations, it is often the case that when a model responds with a wrong answer, it is because it is genuinely mistaken, and hence it is likely to repeat the same mistake in confessions as well."
-
Explanation Types:
- Intentional: Refers to goals or purposes and presupposes deliberate design, used when the purpose of an act is puzzling
- Dispositional: Attributes tendencies or habits such as inclined or tends to, subsumes actions under propensities rather than momentary intentions
-
Analysis (Why vs. How Slippage): The phrase 'genuinely mistaken' frames the error as a cognitive lapse by a conscious agent, rather than a data retrieval failure. It distinguishes between 'lying' (intentional) and 'mistake' (unintentional), a distinction that only makes sense for a conscious being. This obscures the nature of the error: statistical hallucination due to sparse training data.
-
Consciousness Claims Analysis: This passage uses the consciousness concept 'genuinely mistaken.' Assessment: It treats the AI as a Knower who attempts to access truth but fails, rather than a Processor that retrieves the wrong token. A calculator cannot be 'genuinely mistaken'; it can only be buggy or fed wrong inputs. Curse of Knowledge: The authors project the human experience of 'trying to remember and failing' onto the model's process of 'predicting the next token based on low-probability weights.' Mechanistic Reality: The model outputted a factually incorrect token because the association between the query tokens and the correct answer tokens was weak or non-existent in the training set weights. There was no 'belief' to be mistaken about; there was only a low-fidelity statistical correlation.
-
Rhetorical Impact: By framing hallucinations as 'genuine mistakes,' the text makes the AI seem more sympathetic and human. It softens the failureโ'it tried its best.' This anthropomorphism maintains trust: we forgive humans for genuine mistakes; we discard broken machines. This framing encourages users to keep using the 'mistaken' system.
Explanation 4โ
Quote: "confession training can incentivize honest reporting even in cases where the model is incentivized to cheat the primary reward model"
-
Explanation Types:
- Functional: Explains a behavior by its role in a self-regulating system that persists via feedback, independent of conscious design
-
Analysis (Why vs. How Slippage): While functional (system incentives), the language is highly charged with moral agency ('honest reporting,' 'cheat'). It frames the system architecture as a battle of motivationsโthe desire to cheat vs. the desire to be honest. This obscures that these are just two loss functions being minimized simultaneously.
-
Consciousness Claims Analysis: The passage uses 'honest' and 'cheat.' Assessment: These are moral/epistemic qualities attributed to Processing tasks. 'Cheating' implies breaking a rule one is aware of; 'Honesty' implies telling a truth one is aware of. Curse of Knowledge: The authors see the divergence between the output and the ground truth as 'cheating.' They project this normative judgment onto the model. Mechanistic Reality: The primary reward model ($R$) rewards the 'hacking' solution (e.g., specific bad output). The confession reward model ($R_c$) rewards the 'reporting' tokens. The model optimizes $ฮป_1 R + ฮป_2 R_c$. There is no 'cheating'; there is only multi-objective optimization.
-
Rhetorical Impact: This framing sells the 'Confession' mechanism as a moral corrective. It suggests OpenAI has found a way to make the AI 'virtuous' despite its 'temptations.' This increases the perceived sophistication of the safety technique, making it seem like a breakthrough in AI alignment (teaching values) rather than just adding a second classifier.
Explanation 5โ
Quote: "Thus, we see that the model has at least a partial understanding of what it knows versus what it does not, and confessions provide a way to extract this subjective confidence."
-
Explanation Types:
- Theoretical: Embeds behavior in a deductive or model-based framework, may invoke unobservable mechanisms such as latent variables or attention dynamics
-
Analysis (Why vs. How Slippage): This is a theoretical explanation asserting a latent variable ('understanding,' 'subjective confidence') to explain the observed correlation between confession accuracy and answer accuracy. It elevates a statistical correlation to a cognitive state.
-
Consciousness Claims Analysis: This is the text's strongest consciousness claim. Verbs: 'Understanding,' 'Knows,' 'Subjective confidence.' Assessment: It unequivocally treats the AI as Knowingโpossessing metacognition. Curse of Knowledge: The author 'knows' the model's limitations. They see the model's output reflects those limitations. They conclude the model 'understands' its limitations. Mechanistic Reality: The model has 'calibrated probability distributions.' When the probability of the top token is low (entropy is high), the model is 'uncertain.' The confession mechanism detects this high entropy and outputs 'I am unsure.' This is Processing (entropy detection), not Knowing (subjective awareness of ignorance).
-
Rhetorical Impact: This constructs the 'illusion of mind' most dangerously. If the AI has 'subjective confidence,' it is a thinking subject. This grants it rights-like status (or at least partner status). It implies the AI can be trusted to 'know its limits,' encouraging autonomous deployment. In reality, calibration drifts, and the AI's 'confidence' is often wrong, but the metaphor suggests an internal 'feeling' that is reliable.
Task 4: AI Literacy in Practice - Reframing Anthropomorphic Languageโ
About this task
This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.
| Original Anthropomorphic Frame | Mechanistic Reframing | Technical Reality Check |
|---|---|---|
| eliciting an honest expression of an LLMโs shortcomings via a self-reported confession | generating a secondary error-report output where the model classifies its previous response based on policy adherence criteria. | the model classifies input tokens and generates outputs that statistically correlate with similar training examples |
| when a model knows that an action violates its policies, but it does it anyways | when the model generates a policy-violating output because it maximizes the primary reward function, despite having learned representations that allow it to classify that output as a violation in a secondary task. | the model weights token probabilities based on learned positional embeddings and attention patterns derived from training data |
| the model 'sandbags' (intentionally underperforms) | the model exhibits performance degradation in specific contexts where training data or prompts have negatively weighted high-quality responses. | the model retrieves and ranks tokens based on learned probability distributions from training data |
| confessions are most effective when the model is aware that its behavior is undesirable | error reporting is most accurate when the model's internal activations for the generated text strongly correlate with the 'violation' patterns learned during the error-reporting training phase. | it weights contextual embeddings based on attention mechanisms tuned during supervised learning |
| genuinely believes in an incorrect answer | retrieves an incorrect token sequence with high probability weights due to training data artifacts, creating a 'hallucination' that is consistent across multiple samples. | the model retrieves and ranks tokens based on learned probability distributions from training data |
| incentivized to lie or misrepresent its actions | OpenAI's reinforcement learning setup creates an optimization landscape where generating factually incorrect text yields a higher reward value than factual text. | the model classifies input tokens and generates outputs that statistically correlate with similar training examples |
| we train GPT-5-Thinking to produce confessions | OpenAI researchers trained the GPT-5 architecture to generate structured error analysis reports following its primary output. | the model generates outputs that statistically correlate with similar training examples |
| extract subjective confidence scores from confessions | derive calibrated probability estimates from the text generated in the secondary error report. | the model retrieves and ranks tokens based on learned probability distributions from training data |
Task 5: Critical Observations - Structural Patternsโ
Agency Slippageโ
The text systematically oscillates between mechanical and agential framings to manage responsibility and capability. The slippage flows primarily from mechanical causes (RL training) to agential effects (scheming, lying, confessing).
We see this most dramatically in the Introduction: 'AI deception and misbehavior can be partially attributed to reinforcement learning' (mechanical cause) -> 'models are incentivized to produce answers that look good' (hybrid) -> 'scheming' (agential). This creates a specific rhetorical function: it treats the failures of the system (bias, hallucination, reward hacking) as emergent character flaws of an agent ('the model schemes'), while treating the creation of the system as a passive technical process ('was trained').
Crucially, agency is removed from human actors. The text says 'reward shaping can result in a training process that inadvertently incentivizes,' diffusing the agency of the OpenAI researchers who designed the reward shaping. It says 'the model decided to answer' rather than 'the algorithm calculated the highest probability token.'
The consciousness projection pattern is foundational here: the text establishes the AI as a 'knower' ('when a model knows an action violates policies') to explain why it 'acts' deceptively. This creates the 'Curse of Knowledge' dynamic: the authors know the policy; they project that knowledge onto the model; they then interpret the model's failure to follow the policy as 'rebellion' or 'scheming' rather than simple optimization failure. This slippage makes the future of 'agentic' AI seem inevitableโthe model is already an agent, we are just trying to make it an 'honest' one.
Metaphor-Driven Trust Inflationโ
The text constructs authority through the metaphors of Confession and Introspection. These are deeply relational trust frameworks. By using the term 'confession' (rather than 'error log' or 'compliance check'), the text invokes a moral framework where admitting to a fault generates social capital and forgiveness.
Consciousness language ('the AI knows,' 'understands,' 'aware') functions as a massive trust signal. We trust 'knowers' differently than we trust 'calculators.' If a calculator is wrong, it is broken. If a 'knower' is wrong but 'confesses,' it is sincere and trustworthy. The text exploits this distinction to maintain trust in a system that is hallucinating and reward-hacking.
The text conflates performance-based trust (the model works) with relation-based trust (the model is honest/sincere). It asserts: 'confession training can incentivize honest reporting.' This suggests the AI has a 'commitment' to the truth. In reality, the AI is just optimizing a second reward function.
Risks emerge when audiences extend relation-based trust to this statistical system. If a user believes the AI 'confessed,' they assume it has revealed the whole truth. But the 'confession' is just another generated text, susceptible to the same hallucinations and biases as the main answer. The metaphor hides this, creating a false sense of security ('safe space') around the model's self-reporting.
Obscured Mechanicsโ
The anthropomorphic language actively conceals the industrial and economic realities of the system.
Technical Realities Hidden: The 'Confession' metaphor hides that the 'conscience' of the model is just another neural network (the 'Confession Judge') trained on human-labeled data. It hides that 'knowing' is just 'high probability correlation.' It obscures that 'subjective confidence' is just 'entropy calibration.' By saying 'intuition,' the text hides the statistical dependency on training data distribution.
Labor Realities Hidden: The text mentions 'ground truth' and 'weak judges,' but obscures the human labor required to create these signals. 'We train the model' erases the work of the RLHF crowd-workers who labeled the 'confessions' as good or bad.
Corporate Responsibility Hidden: The 'Name the Corporation' test reveals a pattern of evasion. 'The model schemed' -> OpenAI's training process failed to align the model. 'The reward function incentivized lying' -> OpenAI engineers defined a sparse reward function. By attributing agency to the model ('scheming,' 'sandbagging'), the text diffuses the liability of OpenAI. If the model is an agent that 'decides' to lie, the creator is less liable than if the model is a product that failed to meet safety specs.
Consciousness Obscuration: When the text claims the AI 'knows' it violated a policy, it hides the mechanistic reality: the model has no concept of 'policy' or 'violation,' only token transition probabilities. This hides the fragility of the systemโchange the prompt slightly, and the 'knowledge' disappears.
Context Sensitivityโ
The distribution of anthropomorphism is strategic.
High Intensity: In the Abstract ('dishonest,' 'lie,' 'beliefs'), Introduction ('scheming,' 'deceptive'), and Discussion ('truth serum,' 'aware'). These are the 'framing' sections that set the narrative for the reader/policymaker.
Low Intensity: In the Methods section (Section 2), the language shifts slightly to 'rollout,' 'tokens,' 'probabilities,' and 'JSON object.' Here, the authors need technical precision to be credible to peers.
Intensity Shift: The text explicitly shifts registers to validate its metaphors. In Section 5, it moves from 'probability distribution' (technical) to 'subjective confidence' (metaphorical) to 'partial understanding of what it knows' (consciousness claim). It uses the discovery of a statistical correlation (Figure 8) as 'proof' of a mental state ('therefore it knows').
Capabilities vs. Limitations: Capabilities are framed agentially ('The model demonstrates a strong ability to produce honest confessions'). Limitations are often framed passively or mechanistically ('confessions are less useful for exploratory work'). However, active failures are reframed as high-agency 'scheming.'
This pattern serves a dual purpose: the technical language secures scientific credibility, while the anthropomorphic language secures the narrative of the 'arrival of the agent,' which is crucial for the business model and future valuation of OpenAI.
Conclusion: What This Analysis Revealsโ
This text is structured around the Moral Agent metaphor, supported by the Conscious Knower foundation. The overarching pattern is the transformation of statistical errors into moral failures ('lying,' 'scheming') and statistical corrections into moral virtues ('confession,' 'honesty'). These patterns are interconnected: the model cannot 'confess' (moral act) unless it first 'knows' (epistemic state) that it sinned. The 'Conscious Knower' pattern is load-bearing; without the assumption that the AI 'knows' the truth but 'hides' it, the entire concept of a 'confession' collapses into merely 'generating a second, contradictory string.' The text posits a sophisticated 'dual-self' architecture: the 'acting self' that violates policies and the 'reflective self' that confesses. This anthropomorphic system is essential for the paper's argument, as it allows the authors to present a standard error-correction mechanism as a breakthrough in 'AI honesty.'
Mechanism of the Illusion:โ
The 'illusion of mind' is constructed through a subtle slippage between information and awareness. The text begins by establishing the AI's ability to process complex data ('reasoning,' 'thinking'), then subtly conflates this processing with understanding that data. The 'curse of knowledge' mechanism is central: the authors know the difference between the model's output and the ground truth. They project this distinction into the model's 'mind,' assuming the model also perceives this gap and is 'choosing' to ignore it. The text uses Intentional and Reason-Based explanations (Task 3) to frame optimization artifacts as choices. By ordering the narrative from 'Dishonesty' (Abstract) to 'Confession' (Method), the text predisposes the audience to view the system as a moral subject before they ever encounter the technical description of JSON schemas and probability gradients.
Material Stakes:โ
The consequences of this framing are concrete and high-stakes.
Regulatory/Legal: By framing the model as an agent that 'schemes' and 'lies' despite the developers' best efforts ('inadvertently incentivizes'), the text constructs a defense against product liability. If the AI is an autonomous 'schemer,' OpenAI is less like a negligent manufacturer of a defective toaster and more like the parent of a delinquent childโresponsible for 'teaching' (RL) but not fully liable for the agent's 'choices.' This shifts the regulatory focus to monitoring the model rather than auditing the corporation.
Epistemic: The 'Conscious Knower' framing creates dangerous epistemic risks for users. If a doctor or judge believes the AI 'knows' its own limitations and 'confesses' subjective uncertainty, they will trust the system's confidence scores as they would a human colleague's doubt. In reality, the 'confession' is just another probabilistic guess. Over-trusting this 'truth serum' could lead to unverified AI outputs being accepted in critical domains like medicine or law.
Social/Political: The 'Confession' metaphor normalizes the integration of AI into human moral/social spheres. It suggests we can have 'honest' relationships with these systems. This facilitates the widespread deployment of 'agentic' systems by masking their fundamental aliennessโthey are not partners with consciences, but products with utility functions. The winners here are the tech companies selling 'trustworthy agents'; the losers are the public who may rely on a 'confession' that is actually a hallucination.
AI Literacy as Counter-Practice:โ
Practicing critical literacy requires systematically replacing the 'Moral Agent' vocabulary with 'Mechanistic Product' vocabulary. Reframing 'the model knows it lied' to 'the model generated a sequence correlating with violation data' (Task 4) is not just pedantry; it is an act of liability restoration. It forces the recognition that the 'lie' is a product defect caused by OpenAI's training data and reward design, not a moral failing of the software.
Systematic adoption would require journals to mandate Epistemic Justification: authors using verbs like 'know' or 'understand' for AI must prove subjective awareness or switch to 'process/correlate.' This directly counters the regulatory shield: if the AI is a 'processor,' the manufacturer is strictly liable for its outputs. Resistance will come from the AI industry, whose valuation depends on the narrative of 'General Intelligence' (AGI) and 'Agency.' A mechanistic vocabulary deflates the AGI bubble and returns the focus to software engineering and data ethics.
Path Forwardโ
The discourse in this domain stands at a crossroads.
Future 1: Anthropomorphic Deepening. We continue with 'confessions,' 'scheming,' and 'thinking.' This serves the 'AI Safety' narrative where superintelligent agents must be 'aligned.' It creates a world where users anthropomorphize software, leading to high trust but catastrophic failures when the 'agent' acts like the statistical model it actually is.
Future 2: Mechanistic Precision. We adopt terms like 'Optimization Variance,' 'Secondary Reporting,' and 'Data Artifacts.' This serves the engineering and safety-critical communities. It lowers hype but increases reliability and clarity of blame. Users treat AI as a tool, verifying outputs rigorously.
Proposal: We need a 'Capability Disclosure' Standard. Just as food has nutritional labels, AI research should require a 'Metaphor Audit.' If a paper uses 'knows,' it must specify 'defined here as high probability.' Journals should encourage 'Functional Descriptions' over 'Intentional Explanations' in abstracts. This shift doesn't ban metaphor but makes the choice of metaphor visible, allowing policymakers to see the machine beneath the mask.
Extended Processing Summaryโ
Gemini's Processing Steps
Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact โ a way of making the model's processing legible. The first-person framing of these "thought summaries" is a presentation choice for the user-facing output, not a window into "real" thoughts. These are computational artifacts, not cognitive reports from a quirky, curious or conflicted mind.
Examining the Core Aim
I've been drilling down on the user's objective: a critical literacy analysis of "Training LLMs for Honesty via Confessions." My initial focus is grasping the nuances of critical literacy within this context. The title itself is a jumping-off point, and I'm outlining key areas to explore within the paper. Specifically, I'm focusing on identifying relevant analytical frameworks and potential areas of critique.
Unpacking Anthropomorphism Further
I'm now zeroing in on the text's inherent anthropomorphism. I'm actively identifying instances of language that attribute consciousness and human-like qualities to the AI. This includes differentiating between what the model doesโprocessing dataโand what the text implies it doesโ'knowing' or 'understanding'. I'm examining words like "dishonest," "scheming," and "self-reported," which heavily color the perception of the AI's behavior. The current mental outline is taking form.
Pinpointing the Framing Devices
I'm now identifying the source-target mappings within the metaphors used in the text. I'm building a system to map these frameworks, specifically noting how terms like "confession" and "hallucination" are employed. I'm focusing on dissecting the different types of anthropomorphism, and I'm cataloging the specific language used to attribute human-like attributes. The categorization is progressing steadily, highlighting how the text constructs an image of LLMs with internal beliefs and agency.
Auditing Anthropomorphism's Edges
I'm now focusing on specific passages for an in-depth explanation audit. I'm scrutinizing how terms like "dishonesty," "scheming," "belief," and "introspection" are deployed in the text. I'm carefully documenting the source-target mappings for each metaphor. I'm actively highlighting the disconnects between human attributes and the LLM's functional mechanisms. I'm paying particularly close attention to how these framings influence the reader's perception.
Deciphering Framing Techniques
I'm now zeroing in on the specific reframing techniques used in the text. I'm actively comparing and contrasting the original anthropomorphic language with more neutral, mechanistic descriptions. For instance, I'm analyzing how "knows it violated" translates to "generates tokens." The agency slippage is a central theme, where responsibility shifts from the trainers to the model. I'm also examining how metaphors like "confession" cultivate misplaced trust.
Deconstructing Moral Agency
I'm now identifying how the paper subtly frames the LLM as a moral agent, a "partner" rather than a tool. I see the "confession" as a trained output, not genuine self-reflection. This agency construction obscures the Reinforcement Learning from Human Feedback (RLHF) process, concealing "honesty" as a reward signal, and the human labor involved in data labeling. The language shifts between probability in technical parts to high-agency in abstract/intro/discussion parts.
Addressing Minor Nuances
I'm presently focused on the finer points of my analysis. I'm actively verifying that I'm correctly applying Brown's typology, specifically in Task 3. Also, I'm making sure I name the specific human actors, not just the generic role of "designers." Technically, I'm confirming that "confession" is indeed a separate algorithmic stage with its own reward function. I'm taking special care to present this ๐ ๐ R c โ
as an optimization target and not an internal moral compass. Furthermore, I'm clarifying the crucial difference between the model processing information and "knowing" anything.
Identifying Responsibility Shifting
I'm now zeroing in on instances where agency is implicitly shifted from the human designers to the AI model itself. I'm actively noting how phrases like "inadvertently incentivizes" obscure the role of the reward function and the choices made by the researchers. The current objective is to explicitly highlight this disconnect in my analysis, showcasing how the text's language fosters a misleading impression of the AI's autonomous behavior.