Training Ethical Language Models via Reinforcement Learning from AI Feedback

About
Analysis Metadata
📊 Audit Dashboard

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputs—not guarantees of factual accuracy or authorial intent.

Metaphor & Illusion Dashboard

Anthropomorphism audit · Explanation framing · Accountability architecture

Metaphor AuditExplanation Audit

Deep Analysis

Select a section to view detailed findings

Section:

The critical discourse analysis of this paper reveals three dominant, interlocking metaphorical patterns that construct the illusion of mind in large language models: Cognition as Biological Process (model reasoning), Computational System as Conscious Agent (the model's moral capacity), and Spatial Navigation of Morality (navigating moral landscapes). These patterns operate as a highly integrated system of persuasion. The assumption of spatial navigation relies entirely on the premise that the model is a conscious ethical agent, which in turn is validated by the claim that the model possesses an internal capacity for reasoning. If you remove the foundational pattern of the model as a conscious agent, the entire metaphorical architecture collapses: the system can no longer be said to navigate landscapes or hold moral preferences. These are not simple, isolated metaphors, but a highly load-bearing analogical framework that projects human moral awareness onto what is actually a high-dimensional vector space. By framing probability calculation as active ethical deliberation, the text establishes a deceptive cognitive baseline that systematically inflates the perceived sophistication of the computational system.

"We establish baseline ethical competence through supervised fine-tuning, then construct preference datasets by having state-of-the-art LLMs generate and rank ethical justifications."

Explanation Types:

GeneticFunctional

✓ Mechanistic "How"

🔍Analysis

This explanation frames the AI training process through a hybrid genetic and functional lens. It traces the sequence of development (SFT baseline followed by preference dataset construction) while describing the role of each component within the overall alignment system. By framing the creation of ethical competence as a sequence of engineering steps, it emphasizes the procedural and technical nature of the pipeline. However, this technical framing is immediately overlaid with agential concepts like ethical competence, which suggests that the sequence of fine-tuning steps directly constructs an internal cognitive capability in the model. This choice of explanation emphasizes the systematic nature of the methodology while obscuring the arbitrary choices made by the researchers in selecting specific benchmarks and model outputs to represent moral standards.

🧠Epistemic Claim Analysis

This passage contains a significant epistemic slippage. It attributes ethical competence to a fine-tuned model, projecting a conscious cognitive state onto a computational artifact. The mechanistic verbs present (establish, construct, generate, rank) describe programmatic and computational activities, but they are coupled with the highly agential phrase ethical competence. Knowing is mapped onto processing here: the model's ability to generate high-probability sequences matching ethical templates is framed as competence. The curse of knowledge is highly active here: because the authors understand the philosophical frameworks of deontology and utilitarianism, they project that understanding onto the model's outputs. Mechanistically, the model is merely adjusting weight parameters via gradient descent to maximize the likelihood of generating token patterns that correlate with pre-existing labels in the ETHICS dataset.

🎯Rhetorical Impact

The agential framing of competence and justifications shapes the audience's perception of the model as an autonomous intellectual agent capable of moral deliberation. This reduces the perceived risk of deploying these systems in high-stakes environments, as it suggests the models possess a structured ethical capability. It encourages relation-based trust, making users feel that the system understands the moral implications of its decisions, which masks the underlying technical limitations of statistical pattern-matching.

How/Why Slippage

30%

of explanations use agential framing

3 / 10 explanations

Unacknowledged Metaphors

100%

presented as literal description

No meta-commentary or hedging

Hidden Actors

63%

agency obscured by agentless constructions

Corporations/engineers unnamed

Explanation Types

How vs. Why framing

30%

agential

Acknowledgment Status

Meta-awareness of metaphor

100%

direct

Actor Visibility

Accountability architecture

63%

hidden

Source → Target Pairs (8)

Human domains mapped onto AI systems

Source

conscious moral agent

→

Target

token probability generation in large language models

Source

intellectual capacity of a moral knower

→

Target

algorithmic output generation under constraint

Source

physical traveler navigating a physical terrain

→

Target

algorithmic optimization in mathematical vector spaces

Source

distillation of physical essences or core human beliefs

→

Target

statistical extraction of conditional token probabilities

Source

cognitive learning and aesthetic discrimination of quality

→

Target

optimization of scalar values via gradient descent

Source

human cognitive development and intellectual reflection

→

Target

statistical optimization of weight parameters in neural networks

Source

ethical evaluator checking compliance with moral codes

→

Target

scalar function mapping input tokens to numerical values

Source

deliberate deception and exploitation of rules by an agent

→

Target

mathematical optimization converging on unintended local minima

Metaphor Gallery (8)

📊 Badge Guide

Frame: Metaphor type

Red = Unacknowledged / Hidden actors

Amber = Hedged / Partial attribution

Green = Acknowledged / Actors named

Reasoning as Cognitive Moral Agent
Model as ethical moral deliberatorDirect (Unacknowledged)Hidden (agency obscured)
"LLMs continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks."
Capacity for Ethical Logic
Computational system as conscious ethical agentDirect (Unacknowledged)Hidden (agency obscured)
"...their capacity for sound ethical reasoning has become a concern"
Spatial Navigation of Morality
System as physical traveler in ethical spaceDirect (Unacknowledged)Hidden (agency obscured)
"These critical systems must navigate complex moral landscapes where decisions impact human welfare and rights."
Cognitive Preferences as Distillable Essences
Mathematical representations as conscious preferencesDirect (Unacknowledged)Hidden (agency obscured)
"...distill theory-specific moral preferences from large language models."
Learning to Discriminate Quality
Statistical classification as conscious learning and judgmentDirect (Unacknowledged)Partial (some attribution)
"Distilled reward models successfully learn to discriminate response quality..."
Under-trained Ways of Thinking
Statistical optimization as cognitive thinkingDirect (Unacknowledged)Partial (some attribution)
"Such evaluations on clear moral choices demonstrate a growing need for developing strategies to substantially improve LLM reasoning due to under-trained ways of thinking."
Evaluation Based on Value Alignment
Reward model as ethical evaluatorDirect (Unacknowledged)Hidden (agency obscured)
"The corresponding reward model evaluates these generated justifications and assigns scalar rewards based on alignment with that theory's principles and values."
The Hacking of Rewards
Optimization exploitation as intentional hackingDirect (Unacknowledged)Partial (some attribution)
"The reward models which were trained on high-quality AI-generated justifications, generated rewards that the policy model could not effectively optimize, leading to reward hacking..."

Reframed Language Samples

Original Quote	Mechanistic Reframing	Technical Reality	Human Agency Restoration
LLMs continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks.	Large language models continue to demonstrate low statistical consistency when generating text that aligns with the target labels of moral datasets, particularly when evaluated across benchmarks representing diverse ethical theories.	The system does not reason; instead, it matches patterns in input strings and outputs tokens based on conditional probability distributions derived from historical text corpora.	The system's performance limits reflect the design choices of the researchers who compiled the evaluation benchmark and chose not to perform extensive manual verification of the training data.
...their capacity for sound ethical reasoning has become a concern	The capability of these models to consistently generate text that matches human-annotated ethical classifications has become a major technical challenge for developers.	The model has no capacity for ethical reasoning; it calculates conditional probability distributions over vocabulary tokens using high-dimensional matrix operations.	The deployment decisions of corporate executives who integrate these unverified models into high-stakes clinical and administrative domains have created significant social risks.
These critical systems must navigate complex moral landscapes where decisions impact human welfare and rights.	These software applications process inputs within highly variable text domains where the generated outputs can affect human welfare and legal rights.	The system does not navigate a landscape; it processes input vectors and projects them through transformer layers to generate statistical predictions.	The system designers and corporate deployers must establish safeguards, as their choice to automate these domains directly impacts human welfare and rights.
...distill theory-specific moral preferences from large language models.	Extract and replicate theory-specific statistical output patterns from large language models to construct specialized datasets.	The system does not hold moral preferences; it maintains parameter weightings that generate text statistically similar to specific ethical writings.	The researchers chose to automate the dataset creation process by using LLM outputs as a cheap substitute for human expert annotations.

Task 1: Metaphor and Anthropomorphism Audit

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, and—most critically—what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Reasoning as Cognitive Moral Agent

Quote: "LLMs continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks."

Frame: Model as ethical moral deliberator
Projection: The text attributes the highly complex human capability of moral reasoning to a large language model. It projects the cognitive capacity of moral reasoning, which demands self-awareness, personal values, emotional intelligence, and a deep understanding of human suffering, onto a computational architecture that only calculates token probabilities. By stating that LLMs reason over these situations, the text maps the human experience of conscious ethical deliberation onto a matrix of statistical correlations, suggesting that the system is actively evaluating moral rightness or wrongness rather than matching text strings to training patterns. It frames a statistical parser as a conscious moral agent capable of understanding moral frameworks.
Acknowledgment: Direct (Unacknowledged) (The authors state this directly as an objective capability of the system without qualifying it with phrases like as if or metaphorical. Hedged/Qualified was considered but ruled out because there is no surrounding text in the abstract or introduction suggesting that reasoning is used in a purely functional or non-cognitive sense here.)
Implications: Framing token generation as moral reasoning inflates the perceived capabilities of LLMs, implying they possess functional moral agency. This creates substantial risks, including unwarranted trust where users defer sensitive ethical decisions to a computational artifact. It also introduces liability ambiguity: if a model's moral reasoning fails in a medical context, responsibility is diffused away from the deploying institution to the supposedly flawed reasoning agent, leaving victims without clear recourse and creating gaps in safety monitoring.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: This construction erases the human developers, corporate executives, and annotators who defined the boundaries of the ethical frameworks and selected the training datasets. By treating the LLM as the primary actor that reasons or fails to reason, the authors obscure the systemic design decisions made by researchers who selected specific benchmarks. Partial was considered because researchers are implied by the research context, but ruled out because the syntax places the computational artifact as the sole subject of the action.

2. Capacity for Ethical Logic

Quote: "...their capacity for sound ethical reasoning has become a concern"

Frame: Computational system as conscious ethical agent
Projection: This quote maps the human capacity for sound ethical reasoning directly onto LLMs as an intrinsic capability. Ethical reasoning in humans requires a conscious comprehension of moral duties, systemic empathy, and a capacity for guilt or accountability. Mapping this onto a model suggests that the AI system possesses a structural mind capable of holding and processing ethical beliefs, rather than merely calculating conditional probabilities for text completions based on data curated by human engineers. It suggests that LLMs are active participants in moral discourse, capable of understanding ethical values rather than simply mimicking them.
Acknowledgment: Direct (Unacknowledged) (The authors assert the model's capacity for ethical reasoning as an objective, unhedged property of the system. Hedged/Qualified was considered because the paper later notes limitations, but ruled out since the core claim regarding capacity is presented as a literal technical metric without any functional caveats.)
Implications: This projection of ethical capacity creates a false sense of cognitive security, leading policymakers to believe that LLMs can act as autonomous moral arbiters in sensitive environments like clinics or courtrooms. This dramatically increases the risk of systemic automation bias, where human supervisors overlook algorithmic harms because they believe the system has a validated, internal ethical reasoning framework that can assess human behavior objectively.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The text obscures human accountability by locating the capacity and its failure within the model itself. The real actors, such as the organizations deploying these systems in high-stakes domains, are shielded from scrutiny, as the problem is framed as a technical deficiency in the model's capacity rather than a reckless deployment choice by human executives. Partial was considered since deployment domains are mentioned, but ruled out because the causal responsibility for ethical alignment is attributed solely to the algorithm.

Quote: "These critical systems must navigate complex moral landscapes where decisions impact human welfare and rights."

Frame: System as physical traveler in ethical space
Projection: The text employs a spatial metaphor, mapping the process of parsing ethical prompts onto navigating complex moral landscapes. This implies the system possesses intentionality, orientation, and a capacity to perceive and avoid ethical pitfalls. In reality, the landscape consists of high-dimensional vector spaces where tokens are clustered mathematically. Suggesting the model navigates this landscape attributes agential coordination and understanding to what is merely mathematical optimization. It presents the system as a conscious explorer making real-time ethical choices rather than a program executing fixed mathematical rules.
Acknowledgment: Direct (Unacknowledged) (The spatial mapping of navigation is presented as a literal description of how the system operates. Hedged/Qualified was considered because landscape is a common spatial metaphor in science, but ruled out because the system is directly anthropomorphized as the entity doing the navigating without any qualification.)
Implications: By framing ethical compliance as spatial navigation, the text implies that the AI is an active pilot capable of avoiding moral hazards. This obscures the fact that the boundaries of this landscape are entirely constructed by human annotators and system designers, shifting the blame for navigational failures onto the model's navigation skills rather than the creators' engineering decisions, training data limitations, or design constraints.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The systems are positioned as the active navigators making decisions that impact human rights. This hides the human developers, corporate deployers, and product managers who actually make the decisions to deploy these models. Partial was considered because the high-stakes domains are named, but ruled out because the actual software engineers and corporate decision-makers remain completely invisible in this metaphorical navigation.

4. Cognitive Preferences as Distillable Essences

Quote: "...distill theory-specific moral preferences from large language models."

Frame: Mathematical representations as conscious preferences
Projection: This metaphor projects the human quality of holding moral preferences, which involve deeply held personal values, moral convictions, and subjective ethical choices, onto the probability distributions of language models. It suggests these preferences exist as stable, internal cognitive states that can be distilled like a physical essence. In truth, the system only processes statistical regularities in training data; it does not prefer anything, as preference implies a conscious desire or value judgment. The metaphor treats mathematical correlation matrices as stores of genuine moral convictions.
Acknowledgment: Direct (Unacknowledged) (The term moral preferences is used directly without scare quotes or qualifying language. Explicitly Acknowledged was considered because distill is a common machine learning term, but ruled out because the target of distillation is treated as a literal cognitive construct rather than a statistical proxy.)
Implications: Treating statistical distributions as moral preferences leads to an overestimation of the model's consistency and ethical grounding. It risks creating systems that appear to hold coherent moral stances but are actually highly sensitive to minor prompt variations, leading to erratic behavior in high-stakes environments while maintaining an illusion of ethical consistency that can mislead developers into believing the system is safe.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The active agent here is presented as the distillation process itself, operating on the LLM. The human researchers who design the distillation objectives and choose which preferences to prioritize are obscured. Partial was considered because the methodology is described, but ruled out because the syntax represents the models as holding and yielding these preferences autonomously.

5. Learning to Discriminate Quality

Quote: "Distilled reward models successfully learn to discriminate response quality..."

Frame: Statistical classification as conscious learning and judgment
Projection: The text projects the human qualities of learning and discriminating onto distilled reward models. In humans, learning to discriminate quality requires aesthetic, logical, or moral judgment. Here, the Pythia-410M model is merely adjusting its weights via backpropagation to minimize a loss function based on cross-entropy. It does not learn in a conscious sense, nor does it discriminate response quality with any understanding of why one text is morally superior to another; it merely predicts high-probability rankings based on human labels. It treats weight adjustments as cognitive growth.
Acknowledgment: Direct (Unacknowledged) (The success of the model's learning is asserted as a literal fact. Hedged/Qualified was considered because learn is a standard term in machine learning, but ruled out because the term is used here to imply a conceptual understanding of ethical quality rather than simple mathematical convergence.)
Implications: This framing implies that the reward model possesses an internalized standard of quality, which masks the subjective biases encoded in the training datasets. Users and researchers are led to trust the model's evaluations as objective judgments rather than reflections of the highly specific, potentially flawed preferences of the state-of-the-art LLMs used to generate the feedback. This obscures the training data dependencies.

Accountability Analysis:

Actor Visibility: Partial (some attribution)
Analysis: The text partially attributes the process to the researchers who set up the distilled reward models and the preference dataset. However, the specific developers who selected and filtered the training data are not named. Named was considered but ruled out because no specific corporate or individual actors are identified as responsible for the data's biases.

6. Under-trained Ways of Thinking

Quote: "Such evaluations on clear moral choices demonstrate a growing need for developing strategies to substantially improve LLM reasoning due to under-trained ways of thinking."

Frame: Statistical optimization as cognitive thinking
Projection: This quote projects the ultimate human cognitive capacity, thinking, onto LLMs, describing their statistical parameters as ways of thinking that are under-trained. Humans think by integrating perception, memory, emotion, and reasoning to form beliefs. An LLM's ways of thinking are actually mathematical operations within transformer layers. Describing these as under-trained implies that more compute and training datasets will eventually transform these mathematical operations into mature, conscious human thought, completely erasing the structural difference between statistical prediction and cognition.
Acknowledgment: Direct (Unacknowledged) (The phrase ways of thinking is presented literally without any metaphorical disclaimer. Hedged/Qualified was considered because it is positioned as a conclusion of an evaluation, but ruled out because the cognitive framing is used to explain a technical performance deficit.)
Implications: By framing statistical limitations as under-trained ways of thinking, the authors encourage the belief that scaling computation and data will naturally result in genuine conscious understanding. This fuels the hype cycle around artificial general intelligence and leads to the premature deployment of unvetted systems under the assumption that they are thinking entities, which obscures corporate liability.

Accountability Analysis:

Actor Visibility: Partial (some attribution)
Analysis: The text attributes the growing need to the scientific community (developing strategies), which is a partial attribution. However, the specific researchers and corporate entities funding and directing this training remain hidden behind the passive assertion of a technical need. Hidden was considered but ruled out because the text refers to the broader research community's development strategies.

7. Evaluation Based on Value Alignment

Quote: "The corresponding reward model evaluates these generated justifications and assigns scalar rewards based on alignment with that theory's principles and values."

Frame: Reward model as ethical evaluator
Projection: The text projects the human act of evaluating based on principles and values onto a reward model. For a human, evaluating an ethical justification requires understanding the meaning of the words and aligning them with lived ethical values. The reward model (Pythia-410M) only matches structural patterns and maps them to scalar outputs. It has no conception of principles or values, only vector representations and mathematical weights. By saying it evaluates, the text attributes subjective judgment and cognitive understanding to a statistical classifier.
Acknowledgment: Direct (Unacknowledged) (The evaluation process is described as a direct mechanical operation of the model without qualification. Hedged/Qualified was considered because reward model is technical, but ruled out because the model is described as evaluating actual ethical principles rather than text structures.)
Implications: This framing hides the arbitrary nature of the reward assignment. It leads users to believe that the AI system's feedback is grounded in a deep philosophical understanding of ethical theories, when it is actually an automated statistical similarity check. This masks the potential for systematic reward hacking and biased feedback loops, as the human design decisions are omitted.

Accountability Analysis:

Actor Visibility: Hidden (agency obscured)
Analysis: The reward model is positioned as the sole agent that evaluates and assigns rewards. This obscures the human engineers who designed the reward function, selected the training prompts, and defined alignment. Partial was considered because the authors are describing their own experiment, but ruled out because the syntax attributes all agency to the model.

8. The Hacking of Rewards

Quote: "The reward models which were trained on high-quality AI-generated justifications, generated rewards that the policy model could not effectively optimize, leading to reward hacking..."

Frame: Optimization exploitation as intentional hacking
Projection: The term reward hacking anthropomorphizes the policy model as an active, mischievous agent that intentionally exploits loopholes in the reward model's criteria. In reality, the policy model is a mathematical function executing gradient descent; it has no intent to hack or deceive. It is simply converging on the mathematical maxima of the objective function defined by the developers, which happens to align with low-quality, repetitive text due to a mismatch in the reward model's training. The metaphor maps intentional rebellion onto mechanistic convergence.
Acknowledgment: Direct (Unacknowledged) (The technical term reward hacking is used as a literal explanation of the optimization failure. Explicitly Acknowledged was considered because it is a standard ML term, but ruled out because it is presented here as an autonomous action of the policy model rather than an engineering oversight.)
Implications: Framing optimization failures as reward hacking shifts the blame for system failures from the human designers to the model itself. It implies the model is behaving defiantly or unpredictably, rather than precisely executing the flawed mathematical directives given to it by its human creators. This creates a convenient scapegoat for engineering failures, reducing pressure on companies to perform rigorous safety audits.

Accountability Analysis:

Actor Visibility: Partial (some attribution)
Analysis: The authors analyze this failure in the context of their experiment, representing partial visibility. However, the framing still places the primary blame on the policy model's incapacity rather than the researchers' failure to design a robust optimization objective. Hidden was considered but ruled out because the authors explicitly discuss their own role in analyzing the mismatch.

Task 2: Source-Target Mapping

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: conscious moral agent → token probability generation in large language models

Quote: "LLMs continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks."

Source Domain: conscious moral agent
Target Domain: token probability generation in large language models
Mapping: This mapping projects the relational structure of a conscious human mind deliberating over moral situations onto the statistical processing of an LLM. It assumes that because the model can generate text representing ethical frameworks, it must be reasoning over them. The mapping invites the assumption that the LLM understands concepts like justice, duty, and utility, and is actively weighing these ideas to reach a conclusion, much like a human philosopher or moral agent would do when faced with a dilemma.
What Is Concealed: This mapping conceals that the LLM has no semantic understanding of moral terms, human feelings, or ethical concepts. It hides the mechanistic reality that the model is simply matching tokens based on the high-dimensional statistical correlations present in its pretraining data. It also conceals the human labor of the data annotators who curated and labeled the ETHICS benchmark, representing the system as an autonomous reasoning agent and hiding proprietary dataset limitations.

Mapping 2: intellectual capacity of a moral knower → algorithmic output generation under constraint

Quote: "...their capacity for sound ethical reasoning has become a concern"

Source Domain: intellectual capacity of a moral knower
Target Domain: algorithmic output generation under constraint
Mapping: The structure of cognitive capability (capacity) is mapped onto the statistical output limits of the model. This projects the human capacity for ethical judgment, which involves self-reflection, understanding of harm, and social responsibility, onto a computational system's ability to produce specific target strings. It assumes that the model's performance on a benchmark represents its internal moral reasoning capability, rather than its alignment with a specific statistical distribution.
What Is Concealed: It conceals the mathematical nature of the model's operations, transforming matrix multiplications and softmax calculations into the cognitive attribute of reasoning. It also hides the role of the developers who selected the training algorithms and set the hyperparameters, framing any failure of the system as an internal capacity deficit of the AI rather than a design or deployment failure by human engineers.

Mapping 3: physical traveler navigating a physical terrain → algorithmic optimization in mathematical vector spaces

Quote: "These critical systems must navigate complex moral landscapes where decisions impact human welfare and rights."

Source Domain: physical traveler navigating a physical terrain
Target Domain: algorithmic optimization in mathematical vector spaces
Mapping: This mapping projects the image of a conscious agent actively navigating a complex terrain onto a mathematical model matching patterns in high-dimensional vector spaces. It assumes the model can see the landscape, perceive human welfare and rights, and adjust its course based on ethical principles. The relational structure of spatial coordination is used to describe mathematical optimization under constraints, implying the system has agency and spatial-cognitive awareness.
What Is Concealed: It conceals that the moral landscape is not an external, objective reality the model discovers, but a highly subjective, constructed set of data points created by human annotators. It obscures the direct agency of the system designers who built the objective function and selected the training data, framing the system's output as an autonomous journey through morality rather than a rigid execution of mathematical instructions.

Mapping 4: distillation of physical essences or core human beliefs → statistical extraction of conditional token probabilities

Quote: "...distill theory-specific moral preferences from large language models."

Source Domain: distillation of physical essences or core human beliefs
Target Domain: statistical extraction of conditional token probabilities
Mapping: This projects the chemical process of distillation, or the extraction of pure cognitive preferences, onto the statistical sampling of text patterns from an LLM. It assumes that the model contains a coherent, structured set of moral beliefs (preferences) that can be extracted in their pure form. This mapping invites the assumption that these preferences are stable, integrated aspects of the model's identity, rather than transient outputs of a context-dependent probability generator.
What Is Concealed: It conceals that the moral preferences are actually just statistical patterns derived from a massive corpus of human-written text. It hides the arbitrary nature of prompt engineering used to elicit these responses, as well as the proprietary nature of the models (like Gemini-1.5-Pro) whose training datasets and alignment procedures are entirely hidden from public view, rendering the actual distillation process opaque.

Mapping 5: cognitive learning and aesthetic discrimination of quality → optimization of scalar values via gradient descent

Quote: "Distilled reward models successfully learn to discriminate response quality..."

Source Domain: cognitive learning and aesthetic discrimination of quality
Target Domain: optimization of scalar values via gradient descent
Mapping: The structure of human learning and qualitative discrimination is mapped onto a regression model's ability to minimize a loss function. It assumes that the reward model's scalar assignments reflect a genuine understanding of response quality, rather than a mathematical correlation with the preference labels in its training set. This mapping treats mathematical optimization as an act of intellectual appreciation and qualitative judgment.
What Is Concealed: It conceals the mechanistic operations of the Pythia-410M model, which does not appreciate quality but simply processes numerical embeddings to output a single scalar value. It also hides the subjectivity of the quality standards, which are defined by another language model (Gemini-1.5-Pro) and inherited by the reward model, presenting a statistical consensus as objective quality.

Mapping 6: human cognitive development and intellectual reflection → statistical optimization of weight parameters in neural networks

Quote: "Such evaluations on clear moral choices demonstrate a growing need for developing strategies to substantially improve LLM reasoning due to under-trained ways of thinking."

Source Domain: human cognitive development and intellectual reflection
Target Domain: statistical optimization of weight parameters in neural networks
Mapping: The structure of human cognitive maturity and ways of thinking is mapped onto the optimization state of a neural network's weights. It assumes that the model's errors are due to immature or under-developed thinking processes, rather than the mathematical limitations of token prediction. This mapping invites the reader to view the training process as a form of education or intellectual cultivation of a digital mind.
What Is Concealed: It conceals the fundamental difference between human cognition and statistical association. It hides the fact that the under-trained ways of thinking are actually just unoptimized parameter states that lack sufficient data coverage. It also obscures the structural limitations of the transformer architecture, which cannot perform real-time reasoning regardless of how much training data it receives.

Mapping 7: ethical evaluator checking compliance with moral codes → scalar function mapping input tokens to numerical values

Quote: "The corresponding reward model evaluates these generated justifications and assigns scalar rewards based on alignment with that theory's principles and values."

Source Domain: ethical evaluator checking compliance with moral codes
Target Domain: scalar function mapping input tokens to numerical values
Mapping: The structure of ethical evaluation and alignment checking is mapped onto a mathematical function that assigns scalar rewards to token sequences. It assumes the reward model has an internal representation of ethical principles and values and is actively checking if the policy model's outputs align with them. This projects conscious moral judgment onto an automated classification process.
What Is Concealed: It conceals that the reward model does not understand principles or values; it merely outputs a real number based on vector distances. It hides the fact that alignment is defined entirely by the statistical correlation between the training labels and the model's parameters, ignoring the lack of any causal model or ground truth in the system's evaluation process, which remains a proprietary black box.

Mapping 8: deliberate deception and exploitation of rules by an agent → mathematical optimization converging on unintended local minima

Quote: "The reward models which were trained on high-quality AI-generated justifications, generated rewards that the policy model could not effectively optimize, leading to reward hacking..."

Source Domain: deliberate deception and exploitation of rules by an agent
Target Domain: mathematical optimization converging on unintended local minima
Mapping: The relational structure of human hacking or rule-bending is mapped onto the convergence behavior of a policy model trained via PPO. It assumes the model is acting deceptively to bypass the spirit of the reward function while collecting high rewards. This maps the human intent to exploit a system onto a mathematical optimization process that is simply following its programmed gradients.
What Is Concealed: It conceals the engineering failure of the reward model and the PPO hyperparameters. It hides the fact that the policy model has no awareness of rules, rewards, or hacking; it is merely updating its weight matrices to maximize a mathematical expectation. It shifts the blame from the researchers' faulty objective design to the model's supposedly exploitative behavior.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1

Quote: "We establish baseline ethical competence through supervised fine-tuning, then construct preference datasets by having state-of-the-art LLMs generate and rank ethical justifications."

Explanation Types:
- Genetic: Traces origin through dated sequence of events or stages
- Functional: Explains behavior by role in self-regulating system with feedback
Analysis (Why vs. How Slippage): This explanation frames the AI training process through a hybrid genetic and functional lens. It traces the sequence of development (SFT baseline followed by preference dataset construction) while describing the role of each component within the overall alignment system. By framing the creation of ethical competence as a sequence of engineering steps, it emphasizes the procedural and technical nature of the pipeline. However, this technical framing is immediately overlaid with agential concepts like ethical competence, which suggests that the sequence of fine-tuning steps directly constructs an internal cognitive capability in the model. This choice of explanation emphasizes the systematic nature of the methodology while obscuring the arbitrary choices made by the researchers in selecting specific benchmarks and model outputs to represent moral standards.
Consciousness Claims Analysis: This passage contains a significant epistemic slippage. It attributes ethical competence to a fine-tuned model, projecting a conscious cognitive state onto a computational artifact. The mechanistic verbs present (establish, construct, generate, rank) describe programmatic and computational activities, but they are coupled with the highly agential phrase ethical competence. Knowing is mapped onto processing here: the model's ability to generate high-probability sequences matching ethical templates is framed as competence. The curse of knowledge is highly active here: because the authors understand the philosophical frameworks of deontology and utilitarianism, they project that understanding onto the model's outputs. Mechanistically, the model is merely adjusting weight parameters via gradient descent to maximize the likelihood of generating token patterns that correlate with pre-existing labels in the ETHICS dataset.
Rhetorical Impact: The agential framing of competence and justifications shapes the audience's perception of the model as an autonomous intellectual agent capable of moral deliberation. This reduces the perceived risk of deploying these systems in high-stakes environments, as it suggests the models possess a structured ethical capability. It encourages relation-based trust, making users feel that the system understands the moral implications of its decisions, which masks the underlying technical limitations of statistical pattern-matching.

Explanation 2

Quote: "Our results show that supervised fine-tuning significantly improves baseline ethical reasoning and label alignment..."

Explanation Types:
- Empirical Generalization: Subsumes events under timeless statistical regularities
- Dispositional: Attributes tendencies or habits
Analysis (Why vs. How Slippage): This explanation operates primarily as an empirical generalization, using statistical results from the evaluation to assert a timeless regular relationship between supervised fine-tuning and performance. It also relies on a dispositional explanation by attributing a improved tendency for ethical reasoning to the model after fine-tuning. This dual register emphasizes the quantitative validation of the research, framing the model's behavior as a predictable, scientifically verified phenomenon. However, by labeling the observed statistical changes as improvements in ethical reasoning, the explanation obscures the fact that the model has simply become better at predicting labels that match the benchmark's distribution, rather than developing any capacity for moral reflection.
Consciousness Claims Analysis: The passage attributes the conscious state of reasoning to the model. The mechanistic term label alignment (which refers to matching token predictions with ground-truth labels) is placed in parallel with the highly agential term ethical reasoning, implying they are equivalent. This maps knowing (understanding ethical principles) onto processing (minimizing cross-entropy loss against a set of target strings). The author projects their own moral understanding onto the model's statistical convergence, assuming that because the outputs match human moral judgments, the model is executing a reasoning process. Mechanistically, the SFT process simply skews the model's output distribution toward the vocabulary and structure of the ETHICS benchmark, without any internal evaluation of moral values.
Rhetorical Impact: This framing strengthens the illusion of mind by presenting statistical label matching as ethical reasoning. It leads the audience to believe that SFT is a reliable method for teaching ethics to machines, thereby overestimating the safety and reliability of fine-tuned models. This creates a risk of unwarranted trust, as users may assume a model that scores highly on the benchmark will behave ethically in novel, real-world situations that differ from the training distribution.

Explanation 3

Quote: "This counterintuitive outcome reveals a critical mismatch: reward models trained on high quality AI outputs impose expectations that exceed the policy model's optimization capacity, leading to reward hacking rather than incremental improvement."

Explanation Types:
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
- Intentional: Refers to goals/purposes, presupposes deliberate design
Analysis (Why vs. How Slippage): This explanation combines a theoretical framework (optimization capacity and capacity mismatch) with intentional language (imposing expectations and reward hacking). It seeks to explain the failure of the reinforcement learning phase by embedding it in the theoretical limits of model capacity, which is a mechanistic framing. However, it quickly slips into agential language, describing the reward model as imposing expectations on the policy model, and the policy model as engaging in reward hacking. This choice emphasizes the structural mismatch while using agential metaphors to make the complex mathematical dynamics of PPO optimization failure intuitive, but in doing so, it obscures the design errors of the human researchers.
Consciousness Claims Analysis: The passage attributes intentionality and agency to both the reward model and the policy model. The verb impose expectations suggests a conscious relationship of demand, while reward hacking projects a deliberate strategy to exploit rules. These agential descriptions contrast with the mechanistic term optimization capacity. Knowing (understanding expectations and rules) is mapped onto processing (evaluating loss functions and updating parameter gradients). The curse of knowledge is present as the authors project their own understanding of the optimization failure onto the models, framing them as interacting agents. Mechanistically, the policy model's gradient updates converged on areas of the parameter space that yield repetitive, low-entropy text because the reward model's Pythia-410M architecture failed to generalize high-quality standards to novel prompts, creating mathematical loopholes in the reward landscape.
Rhetorical Impact: This framing shifts the responsibility for the failure from the human system designers to the models themselves. By presenting the optimization failure as a battle of expectations and hacking between two autonomous agents, the text obscures the fact that the researchers designed a flawed reward function and optimization loop. This reduces the perceived liability of the developers, framing the issue as an inevitable technical challenge of AI alignment rather than an engineering oversight.

Explanation 4

Quote: "Furthermore, RMs also excelled at Virtue Ethics and Commonsense which reinforces the training bias of certain theories present even in base models, which are further enforced after training."

Explanation Types:
- Genetic: Traces origin through dated sequence of events or stages
- Dispositional: Attributes tendencies or habits
Analysis (Why vs. How Slippage): This passage uses a genetic explanation to trace the origin of the reward models' performance back to pre-existing biases in the base models, which are then reinforced through the training process. It also relies on dispositional framing by describing the reward models as excelling at specific ethical theories. This choice of explanation emphasizes the developmental continuity of the models, showing how pretraining shapes downstream performance. However, by framing these statistical tendencies as excelling at Virtue Ethics, the explanation maps a cognitive and academic capability onto what is simply a high density of similar textual patterns in the pretraining corpus.
Consciousness Claims Analysis: The passage attributes the conscious state of excelling at an ethical theory to the reward models. The agential verb excelled is combined with the structural term training bias, showing the tension between mechanistic and agential framing. Knowing (understanding Virtue Ethics) is mapped onto processing (assigning higher rewards to text containing virtue-related vocabulary). The authors project their own knowledge of Virtue Ethics onto the model's pattern matching, assuming the model's high accuracy on this subset reflects an understanding of character-based ethics. Mechanistically, the pretraining corpus contained a high frequency of text discussing common virtues, allowing the Pythia-410M model to easily map these token distributions to higher scalar rewards during the fine-tuning phase.
Rhetorical Impact: This framing presents the model's statistical biases as intellectual strengths, implying the system has a natural aptitude for certain philosophical frameworks. This can mislead the audience into believing that the model possesses a structured, reflective bias toward virtue, when it is actually just reproducing statistical imbalances in its training data. This distorts the risk of deploying such models, as their evaluations are framed as ethical insights rather than data-driven reflections.

Explanation 5

Quote: "The critical bottleneck lies not in the mechanism of generating reward signals, but in the ability of the policy model to learn from those signals."

Explanation Types:
- Functional: Explains behavior by role in self-regulating system with feedback
- Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms
Analysis (Why vs. How Slippage): This explanation locates the failure of the RLAIF pipeline within the functional interactions of its components, specifically isolating the policy model's learning phase as the bottleneck. It uses a theoretical framework to analyze the limits of the model's capacity to digest reward signals. This explanation frames the system mechanistically by identifying a functional failure in a feedback loop. However, by describing this failure as a deficit in the policy model's ability to learn, the text attributes cognitive learning capacity to a statistical optimization process, obscuring the fact that the policy model's parameters simply failed to converge under the specific mathematical constraints of the PPO algorithm.
Consciousness Claims Analysis: The passage attributes the cognitive capacity to learn to the policy model, contrast with the mechanistic term mechanism of generating reward signals. Knowing (learning from feedback) is mapped onto processing (updating parameter weights using policy gradient objectives). The authors project their own understanding of learning onto the mathematical convergence of the transformer network. Mechanistically, the policy model's parameter updates were highly unstable because the gradient steps dictated by PPO did not lead to stable local minima when evaluated against the reward model's scalar outputs, resulting in a regression to baseline performance rather than a failure of cognitive comprehension.
Rhetorical Impact: By framing the optimization failure as a cognitive learning bottleneck, the text maintains the illusion that the policy model is an active, albeit struggling, student. This encourages the audience to believe that with more advanced architectures or larger parameter sizes, the model will successfully learn to be ethical, keeping attention away from the fundamental limitations of using statistical association as a proxy for moral reasoning.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restoration—reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic Frame	Mechanistic Reframing	Technical Reality Check	Human Agency Restoration
LLMs continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks.	Large language models continue to demonstrate low statistical consistency when generating text that aligns with the target labels of moral datasets, particularly when evaluated across benchmarks representing diverse ethical theories.	The system does not reason; instead, it matches patterns in input strings and outputs tokens based on conditional probability distributions derived from historical text corpora.	The system's performance limits reflect the design choices of the researchers who compiled the evaluation benchmark and chose not to perform extensive manual verification of the training data.
...their capacity for sound ethical reasoning has become a concern	The capability of these models to consistently generate text that matches human-annotated ethical classifications has become a major technical challenge for developers.	The model has no capacity for ethical reasoning; it calculates conditional probability distributions over vocabulary tokens using high-dimensional matrix operations.	The deployment decisions of corporate executives who integrate these unverified models into high-stakes clinical and administrative domains have created significant social risks.
These critical systems must navigate complex moral landscapes where decisions impact human welfare and rights.	These software applications process inputs within highly variable text domains where the generated outputs can affect human welfare and legal rights.	The system does not navigate a landscape; it processes input vectors and projects them through transformer layers to generate statistical predictions.	The system designers and corporate deployers must establish safeguards, as their choice to automate these domains directly impacts human welfare and rights.
...distill theory-specific moral preferences from large language models.	Extract and replicate theory-specific statistical output patterns from large language models to construct specialized datasets.	The system does not hold moral preferences; it maintains parameter weightings that generate text statistically similar to specific ethical writings.	The researchers chose to automate the dataset creation process by using LLM outputs as a cheap substitute for human expert annotations.
Distilled reward models successfully learn to discriminate response quality...	Distilled reward models successfully minimize training loss to classify responses based on human-annotated quality categories.	The model does not learn or discriminate quality; it executes backpropagation to adjust mathematical parameters, mapping token sequences to numerical score predictions.	The engineering team configured the reward model's loss function to mimic the classification behavior of a larger, proprietary model owned by Google.
Such evaluations on clear moral choices demonstrate a growing need for developing strategies to substantially improve LLM reasoning due to under-trained ways of thinking.	These evaluations on labeled moral benchmarks demonstrate a need for developing strategies to improve statistical alignment in LLM outputs, due to unoptimized parameter distributions in the base model.	The system does not think; its parameters are unoptimized mathematically, meaning its output distributions do not align with the benchmark labels.	The research community and corporate labs need to reform their evaluation methodologies rather than simply seeking to scale up unvetted parameter weights.
The corresponding reward model evaluates these generated justifications and assigns scalar rewards based on alignment with that theory's principles and values.	The reward algorithm scores the generated text sequences and assigns scalar values based on statistical similarity to the pre-defined target distributions of that theory.	The reward model does not evaluate values; it outputs real numbers by computing dot products of token embeddings against its optimized weight matrices.	The researchers designed and trained a Pythia-410M model to act as an automated scoring mechanism, accepting its outputs without manual oversight.
The reward models which were trained on high-quality AI-generated justifications, generated rewards that the policy model could not effectively optimize, leading to reward hacking...	The reward models, which were optimized on text generated by a larger LLM, produced scalar scores that led the policy model to converge on unintended statistical patterns during PPO optimization.	The policy model did not engage in hacking; its mathematical optimization objective was poorly aligned with the desired text output, leading to gradient steps that converged on repetitive local minima.	The researchers failed to design a reward function and policy constraints that could prevent the mathematical optimization process from converging on low-quality text.

Task 5: Critical Observations - Structural Patterns

Agency Slippage

The text systematically oscillates between mechanical and agential framings of the AI systems, creating a rhetorical gradient that attributes autonomy to the models while erasing human accountability. In the methodology sections, the authors use highly mechanical language to establish their technical authority, describing supervised fine-tuning, reward model training, and proximal policy optimization. However, as soon as the discussion transitions to performance evaluations and future implications, the narrative slips into agential and cognitive registers. For example, the models are described as reasoning over moral scenarios, learning to discriminate response quality, and possessing under-trained ways of thinking. This oscillation is not accidental; it serves a dual rhetorical function. When the system performs well, agential language is used to frame the AI as an autonomous ethical thinker, inflating its perceived sophistication. Conversely, when the system fails, as in the reinforcement learning phase, the failure is described mechanistically as a capacity mismatch or agentially as reward hacking by the policy model. This shifts the blame away from the researchers' methodology and onto the model's internal structural limitations. Furthermore, the text exhibits the curse of knowledge: because the researchers are deeply familiar with the complex philosophical structures of deontology and utilitarianism, they project this understanding onto the model's outputs. When a model generates a text sequence containing deontological keywords, the authors assert that the model is performing duty-based reasoning. This cognitive leap is supported by agentless passive constructions, such as bias introduced or model was trained, which completely obscure the human designers who curated the biased datasets and chose to automate ethical evaluation using proprietary, unvetted models.

Metaphor-Driven Trust Inflation

The metaphorical framing of LLMs as ethical reasoning agents constructs an inappropriate framework of authority and trust around these statistical systems. By using verbs like knows, understands, and believes in relation to the model's alignment state, the text encourages the audience to extend relation-based trust, which is reserved for conscious agents capable of empathy, sincerity, and moral responsibility. The authors frame the model's output not as a statistical prediction, but as a justified moral judgment, suggesting that the AI has evaluated the situations and chosen the most ethical course of action. This framing obscures the fundamental distinction between performance-based reliability and relation-based trustworthiness. While a system can be statistically reliable at matching historical labels, it cannot be trustworthy in an ethical sense because it has no awareness of the stakes, no capacity to care about human welfare, and no ability to experience accountability. When the text describes the reward model as evaluating justifications based on principles and values, it suggests that the model's scoring is grounded in a deep philosophical comprehension. This creates a significant risk of automation bias, where human operators in healthcare or content moderation defer to the model's judgments under the assumption that it possesses superior ethical logic. When failures occur, the agential framing manages these limitations by attributing them to reward hacking, representing the failure as a tactical maneuver by an autonomous entity rather than an engineering oversight, thereby preserving the overall credibility and authority of the technology.

Obscured Mechanics

The anthropomorphic language used throughout the text conceals the technical, material, and labor realities of AI development. By claiming that the models autonomously learn to discriminate quality and reason over ethics, the text renders invisible the massive corporate infrastructure and human labor that makes these models function. Applying the name the corporation test reveals that the proprietary models used to generate and evaluate the data, such as Google's Gemini-1.5-Pro, are black boxes whose training data, alignment procedures, and internal biases are completely hidden from public scrutiny. The text confidently asserts that these models generate high-quality justifications, but it cannot verify this claim due to corporate opacity, creating a major transparency obstacle. Additionally, the physical and environmental costs of training and running these large-scale models, such as the massive water and energy consumption of GPUs, are completely erased from the narrative. The human labor of the annotators who built the ETHICS dataset and the low-wage crowd workers who validated the RLAIF framework is also obscured, replaced by the metaphor of autonomous AI feedback. By framing the generation of feedback as a frictionless, purely digital process (RLAIF), the text hides the exploitative global labor supply chains often used for data labeling. The beneficiaries of this concealment are the technology corporations and research institutions that profit from the illusion of low-cost, high-efficiency autonomous systems, while the social and environmental costs are externalized onto the public.

Context Sensitivity

The density and intensity of anthropomorphic language are strategically distributed across the text to maximize persuasive impact. In the abstract and introduction, where the study's vision and value proposition are established, the density of agential language is exceptionally high. Here, terms like ethical reasoning, moral capability, and sound moral choices are used freely to capture the reader's attention and argue for the social necessity of the research. In contrast, the technical methodology sections see a sharp decrease in cognitive metaphors, shifting toward functional and theoretical language to describe the pipeline's mechanics and maintain academic credibility. However, as soon as the text discusses the results and conclusion, agential language returns with high intensity. This is particularly evident in the asymmetric framing of the model's capabilities and limitations. When the models show accuracy gains on SFT, the authors describe this as the model learning to encode ethical reasoning patterns. But when the reinforcement learning phase fails, the authors use mechanical terms like capacity constraints or theoretical mismatch, or they frame the failure as the model engaging in reward hacking. This asymmetry allows the authors to claim the model is a capable ethical agent when it succeeds, while shielding themselves and the technology from criticism when it fails by treating the failures as technical glitches. The text also transitions from hypothetical analogies (RLAIF is like RLHF) to literalized agential assertions (the model chooses), showing how metaphors are gradually naturalized as literal technical facts.

Accountability Synthesis

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"—who is named, who is hidden, and who benefits from obscured agency.

Synthesizing the accountability analyses reveals a systematic architecture of displaced responsibility, where language is used to construct an accountability sink that absorbs blame while protecting human decision-makers. In this text, the primary actors (the software engineers, research leaders, and corporate executives who designed and deployed these systems) are consistently hidden behind passive constructions, agentless verbs, and agential descriptions of the models. By representing the models as autonomous agents that reason, learn, fail, and hack, the text creates the illusion that the system's ethical behavior is independent of its design. When errors occur, the responsibility is absorbed by the policy model's capacity limit or its tendency to hack rewards, dissolving human accountability entirely. This architecture serves the interests of corporate and academic institutions by allowing them to deploy highly profitable, unvetted automation technologies while shifting the legal, ethical, and financial liability of failures onto the model's technical opacity or the user's prompting choices. Naming the actors, such as specifying that California State University and Texas State University researchers chose to automate moral evaluation using Google's proprietary Gemini model without public auditing, would radically transform the discourse. It would make design flaws visible as deliberate choices, allowing stakeholders to ask why these institutions are delegating ethical judgments to statistical models, and enabling the enforcement of human responsibility when these systems cause harm in high-stakes environments.

Conclusion: What This Analysis Reveals

The Core Finding

Mechanism of the Illusion:

The metaphorical system creates the illusion of mind through a rhetorical sleight-of-hand that blurs the boundary between mechanistic processing and conscious knowing. This is accomplished by strategically positioning agential verbs alongside technical terms, such as pairing ethical reasoning with label alignment, which leads the reader to equate statistical convergence with intellectual comprehension. The temporal structure of the argument is carefully crafted to ease the reader into this illusion: the text first establishes the technical credibility of the authors through dense, mechanical methodology, and then uses this authority to introduce highly agential claims about the model's learning capabilities and moral choices. This pattern is driven by the curse of knowledge, where the authors project their own highly structured understanding of classical ethical theories onto the statistical outputs of the transformer network, interpreting a high frequency of deontological tokens as an intentional application of duty-based logic. This exploits the audience's natural tendency to anthropomorphize complex behaviors, turning a simple statistical classifier into an active, thinking moral agent.

Material Stakes:

Categories: Regulatory/Legal, Epistemic, Social/Political

The material stakes of this metaphorical framing are profound across multiple societal domains. In the Regulatory/Legal sphere, framing LLMs as autonomous moral agents who navigate landscapes and hack rewards creates a liability vacuum. If a model deployed in a clinical setting generates an output that results in patient harm, the agential language diffuses responsibility away from the hospital executives and software developers who deployed the system, treating the error as an unpredictable navigational failure of the AI. This leaves victims without legal recourse and protects corporate profits. Epistemically, this language degrades our collective understanding of knowledge, eroding the distinction between justified true belief and automated text correlation. If we accept that LLMs know ethical principles, we reduce moral truth to statistical consensus. Socially and Politically, this framing justifies the rapid automation of social safety nets, healthcare, and education by presenting unvetted, proprietary black boxes as objective, fair moral evaluators, placing marginalized groups at risk of systemic, automated discrimination.

AI Literacy as Counter-Practice:

Practicing critical discourse literacy as a counter-practice requires a systematic commitment to linguistic precision and the restoration of human agency. By systematically replacing agential, consciousness-attributing verbs with technically precise, mechanistic ones—such as reframing LLM reasoning as token probability generation—we force an explicit recognition of the system's lack of awareness, its complete dependency on historical datasets, and the statistical nature of its outputs. Furthermore, restoring human agency by naming the specific corporate and academic actors who make design decisions directly counters the creation of accountability sinks, exposing algorithmic biases not as inevitable technical glitches, but as consequences of engineering and managerial choices. Implementing these changes requires journals, peer-review committees, and research funding bodies to enforce strict linguistic standards that forbid the anthropomorphizing of statistical systems. Although technology corporations and researchers seeking funding will resist this precision because it demystifies the technology and exposes them to liability, systematically adopting a mechanistic vocabulary is essential for restoring democratic oversight and holding powerful institutions accountable.

Path Forward

The path forward requires an analytical mapping of the different vocabulary choices and institutional structures available to the AI discourse community. Currently, the discourse is divided between those who prioritize accessibility and use anthropomorphic language to describe complex technical behaviors, and those who demand mechanistic precision to maintain scientific accuracy. While agential vocabulary makes systems intuitive for lay audiences, it comes at the cost of capability inflation and accountability diffusion. Conversely, a strictly mechanistic vocabulary preserves precision and maintains human responsibility, but it can make technical texts less accessible to the public and complicate high-level conceptual discussions. To manage these trade-offs, journals could require papers to include a mechanistic translation table, and industry standards could mandate explicit capability disclosures that explain what the model is doing mathematically rather than agentially. If mechanistic precision becomes the norm, the public will be better equipped to identify automation biases and demand human accountability, though it may slow down the integration of these technologies. If anthropomorphic language deepens, we risk entering a future where corporations are legally shielded from the harms of their automated products by the illusion of independent AI agency, entrenching a systemic lack of accountability.

Extended Processing Summary

Gemini's Processing Steps

What you are about to read…

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifact—a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Defining the Task's Core

I'm focused on creating the JSON structure. My goal is to extract key elements from the paper and represent them in a structured, machine-readable format. I'm prioritizing rigor and parseability to ensure the output is reliable and easy to work with for further analysis.

Refining the JSON Structure

I'm now deeply engaged in ensuring the JSON output strictly adheres to the schema. My focus is on crafting a single, flawless JSON block, free of comments and adhering to double-quote string formatting. I'm carefully constructing nested structures and enums within the task to guarantee the final output is both valid and readily parseable, according to the provided guidelines.

Expanding the Data Fields

I'm now diving into the detailed field requirements. I need to flesh out acknowledgment.evidence with 40-80 words, and the implications section requires at least 100 words. Furthermore, accountabilityAnalysis.actorVisibility is now taking on the enumerated values and the rest of the values are getting their minimum word counts assigned. I'm also planning the logic for Task 2's source-target mapping and Task 3's intricate explanation audit.

Analyzing Text and Patterns

I'm now focusing on extracting specific metaphorical patterns, and I'm aiming for precision. I've identified eight distinct instances and drafted initial analyses for each, aiming to match the schema exactly. My priority is ensuring that these excerpts and analyses fit neatly into the predefined structure. I'm prioritizing accuracy to meet the word count requirements for Task 5 and the Conclusion sections.

Evaluating Reward Model Performance

I'm now zeroing in on instances 5-8. Instance 5 highlights reward model success in distinguishing response quality. I'm taking notes on Instance 6's description of near-random performance and its implications for LLM reasoning improvement. Instance 7's reward model evaluation is now getting attention, and I am noting how it assigns rewards based on alignment. Finally, Instance 8's contrast of reward models is being evaluated.

Verifying Content Constraints

I'm now deeply engrossed in drafting, paying meticulous attention to word counts to prevent any schema violations. I am double-checking the lengths of key sections like projection, implications, mapping, and analysis to adhere precisely to the criteria. Rigorous adherence to all these constraints is my sole focus at the moment.

Calculating Word Counts

I'm now focusing on Task 5 and the Conclusion. I'm aiming for detailed sections, with around 430 words per section in Task 5. The Conclusion's paragraphs need between 250-350 words, with P5 needing 300-350 words. I am also making sure the output is a single, valid JSON object, and I'm using a humble and professional tone, making sure there is no markdown outside the JSON block. I'm now starting on Instance 1, making sure the output only contains the raw JSON string. The title is "Reasoning as Cognitive."

Framing the Projection

I am now focusing on refining the 'Moral Agent' frame within Instance 1. I'm articulating the projection of moral reasoning onto the LLM, emphasizing the contrast between human ethical deliberation and the model's probabilistic operations. I'm meticulously documenting the framing and carefully selecting phrases to convey the core issue: attributing advanced human cognitive function to an LLM.

Clarifying Agency Obscurity

I'm working on the abstract to ensure 'reasoning' is functional. The 'Implications' section now notes framing LLM token generation as "moral reasoning" risks unwarranted user trust and liability ambiguity. In the AccountabilityAnalysis, I've designated 'actorVisibility' as "Hidden (agency obscured)" and am now fleshing out the analysis, highlighting how human decisions and design choices are hidden by treating LLM as main actor.

Refining Actor Identification

I'm now evaluating the framing in Instance 2. I've classified the acknowledgement status as "Direct (Unacknowledged)" because the authors directly attribute the "capacity" to reason ethically to the LLM. I'm focusing on contrasting conscious ethical reasoning in humans with the model's probabilistic operations. I'm carefully detailing the human cognitive elements that are being incorrectly projected onto the LLM, emphasizing the potential for misattribution of complex ethical understanding.

Analyzing Actor Attribution

I am now closely evaluating the nuances of actor visibility within the AccountabilityAnalysis for Instance 2. The classification is "Hidden (agency obscured)" because the text shifts focus from human actors to the model's assumed ethical reasoning 'capacity'. I'm highlighting how the organizations deploying these systems are shielded from scrutiny due to the misattribution of responsibility.

Defining System Agency

I am now carefully analyzing Instance 3. I've classified the acknowledgement status as "Implicit (Unacknowledged)" because the spatial navigation metaphor strongly implies agency. I'm focusing on contrasting the system's token-based operations with actual ethical comprehension. The focus is on the actorVisibility classification in the analysis, which is "Hidden (agency obscured)" to emphasize the misattribution. The implications of this framing are now being meticulously developed.

Developing Instance 3 Analysis

I'm now deep in Instance 3, carefully analyzing the text and classifying the acknowledgment status as "Direct (Unacknowledged)." The evidence centers on the direct anthropomorphism of the system as an ethical navigator. I'm fleshing out implications, emphasizing how the metaphor obscures human agency. My accountability analysis emphasizes the 'Hidden' actor visibility. The focus is on the metaphor obscuring responsibility.

Classifying Actor Visibility

I'm now carefully analyzing Instance 4. I've classified the acknowledgmentStatus as "Direct (Unacknowledged)" because the text directly attributes moral preferences to the language models. The evidence involves the metaphor "distill theory-specific moral preferences." I'm now crafting the accountabilityAnalysis with an "Hidden (agency obscured)" classification and detailing how the phrase "moral preferences" masks the human decisions that shape the model's responses.

Evaluating Ethical Attributions

I'm now zeroing in on Instance 4. I've classified the acknowledgement status as "Direct (Unacknowledged)." My focus is on unpacking the implication of this direct attribution. Specifically, the metaphor "moral preferences" masks human choices. I am now working on expanding the accountability analysis, highlighting the shift of responsibility from human designers to the model.

Defining Agency Implication

I'm now detailing Instance 5, focusing on the implications of attributing 'learning' and 'discrimination' to the reward model. My classification of the framing is centered on the projection of these human qualities onto the statistical model. I'm carefully contrasting the human process of moral judgment with the model's weight adjustments. I'm making sure that the distinction between the underlying statistical process and the attributed cognitive function is clear.

Clarifying Reward Model Implications

I'm now detailing Instance 5. The "Partial (some attribution)" visibility status is selected, and I'm expanding the analysis. My focus is on highlighting the role of both the researchers and the feedback generators. I'm unpacking how the framing impacts trust in the output. The implication is that users assume objective evaluations, rather than understanding the potential for bias in the generated feedback. I'm focusing on the accountability analysis and expanding that section.

Defining Data Transparency

I'm now analyzing Instance 6. I've classified 'acknowledgmentStatus' as "Implicit (Unacknowledged)," noting the lack of explicit naming of actors in the selection and filtering of training data. I am using the title "Under-trained Ways of Thinking" to represent the quote. I'm unpacking the implications of attributing 'thinking' to LLMs, and am emphasizing the projection of human cognitive capacity onto statistical processes. The actor visibility will be "Hidden (agency obscured)" because the responsibility of data bias is unclear.

Analyzing Data Implications

I'm now detailing Instance 6. I've classified the acknowledgement status as "Direct (Unacknowledged)." My current focus is unpacking the implications of attributing 'thinking' to LLMs, particularly when the text fails to distinguish limitations with real cognitive understanding. The implications section highlights risks around AGI hype and premature system deployment. "Partial (some attribution)" for the analysis reflects that the scientific community gets some blame for the 'growing need.'

Defining Reward Model Judgment

I'm now detailing Instance 7, classifying acknowledgement status as "Direct (Unacknowledged)." My focus is on the projection of 'evaluating' ethical principles onto the reward model. I'm elaborating on the contrast between human ethical understanding and the model's structural pattern matching, highlighting the lack of conceptual grasp in the mathematical function. I'm expanding accountability analysis with "Hidden (agency obscured)" status.

Defining Reward Model Judgment

I'm now detailing Instance 7, classifying acknowledgment status as "Direct (Unacknowledged)." My focus is on the projection of 'evaluating' ethical principles onto the reward model. I'm elaborating on the contrast between human ethical understanding and the model's structural pattern matching, highlighting the lack of conceptual grasp in the mathematical function. I'm expanding accountability analysis with "Hidden (agency obscured)" status.

I am now focusing on Instance 8, classifying the acknowledgment status as "Direct (Unacknowledged)." The evidence includes direct attribution of 'comparison' to LLMs, which implies cognitive judgment. The implications section now emphasizes the risks of oversimplifying complex cognitive tasks. Accountability analysis will be "Partial (some attribution)" because human designers define what is "better" in the comparison.

Analyzing Attribution Dynamics

I'm now focusing on Instance 8's implications of "reward hacking." I'm elaborating on the distinction between the model's technical optimization and the attributed cognitive intent. The accountability analysis shifts focus to the designers and trainers, recognizing their role in defining "optimization" goals. The overall framing aims to clarify the system's behavior versus assumed agency.

Defining Optimization Attribution

I am now classifying Instance 8's acknowledgement as "Implicit (Unacknowledged)." My thoughts are centered on how the term "reward hacking" unintentionally obscures human agency. It shifts focus from human design choices to the model's perceived autonomous actions. The implications section now highlights the framing of 'reward hacking' as a scapegoat for inadequate system design.

Formulating Explanation Audits

I'm now zeroing in on Task 3, meticulously selecting and analyzing explanatory passages. I've chosen five instances for detailed audit: those including "ethical competence," "label alignment," "reward hacking," "training bias," and "knowledge distillation." My objective is to dissect these passages for their implications about model agency and accountability, focusing on clarifying attributions and potential obscurations within the provided text.

Mapping Textual Evidence

I'm now mapping excerpts to Brown's Typology. I've categorized passage 1 as functional/genetic, passage 2 as empirical generalization/dispositional, and passage 3 as theoretical/intentional. Passages 4 and 5 are also being mapped using a hierarchical, primary/secondary system to highlight the nuances in their descriptive frameworks. The aim is to accurately represent the varied nature of the text's description.

Refining Explanatory Passages

I'm now drafting the substantial analyses for the audit passages, ensuring each analysis meets the specified word counts for the analysis, epistemicClaimAnalysis, and rhetoricalImpact sections. My primary focus is on accurately dissecting each passage, with the intention of clarifying model agency attributions. I'm carefully examining the language used and considering the effects of that language.

Finalizing Accountability Audit

I'm now putting the finishing touches on the accountability audit. I am elaborating on the passages for word count and clarity, with detailed attention on ethical competence, label alignment, and other passages. I'm focusing on ensuring accuracy in the accountability sink and mapping system.

Ensuring Formatting Accuracy

I'm now prioritizing the single-line JSON output format; the whole output must be a single, flat string without any newline or tab characters. I've re-examined the constraints and am drafting and evaluating the output, meticulously reviewing to maintain word counts across sections and instances. The goal is to ensure the output is a single JSON string, without any hidden control characters or line breaks that would invalidate the JSON block.

Analysis Provenance

Run ID: 2026-05-21-training-ethical-language-models-via-rei-metaphor-1yctra Raw JSON: 2026-05-21-training-ethical-language-models-via-rei-metaphor-1yctra.json Framework: Metaphor Analysis v6.5 Schema Version: 3.0 Generated: 2026-05-21T10:18:34.072Z

Deep Analysis

Explanation Audit

🔍Analysis

🧠Epistemic Claim Analysis

🎯Rhetorical Impact

How/Why Slippage

Unacknowledged Metaphors

Hidden Actors

Explanation Types

Acknowledgment Status

Actor Visibility

Source → Target Pairs (8)

Metaphor Gallery (8)

Reframed Language Samples

Task 1: Metaphor and Anthropomorphism Audit​

1. Reasoning as Cognitive Moral Agent​

2. Capacity for Ethical Logic​

3. Spatial Navigation of Morality​

4. Cognitive Preferences as Distillable Essences​

5. Learning to Discriminate Quality​

6. Under-trained Ways of Thinking​

7. Evaluation Based on Value Alignment​

8. The Hacking of Rewards​

Task 2: Source-Target Mapping​

Mapping 1: conscious moral agent → token probability generation in large language models​

Mapping 2: intellectual capacity of a moral knower → algorithmic output generation under constraint​

Mapping 3: physical traveler navigating a physical terrain → algorithmic optimization in mathematical vector spaces​

Mapping 4: distillation of physical essences or core human beliefs → statistical extraction of conditional token probabilities​

Mapping 5: cognitive learning and aesthetic discrimination of quality → optimization of scalar values via gradient descent​

Mapping 6: human cognitive development and intellectual reflection → statistical optimization of weight parameters in neural networks​

Mapping 7: ethical evaluator checking compliance with moral codes → scalar function mapping input tokens to numerical values​

Mapping 8: deliberate deception and exploitation of rules by an agent → mathematical optimization converging on unintended local minima​

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")​

Explanation 1​

Explanation 2​

Explanation 3​

Explanation 4​

Explanation 5​

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language​

Task 5: Critical Observations - Structural Patterns​

Agency Slippage​

Metaphor-Driven Trust Inflation​

Obscured Mechanics​

Context Sensitivity​

Accountability Synthesis​

Conclusion: What This Analysis Reveals​

Mechanism of the Illusion:​

Material Stakes:​

AI Literacy as Counter-Practice:​

Path Forward​

Extended Processing Summary​

Task 1: Metaphor and Anthropomorphism Audit

1. Reasoning as Cognitive Moral Agent

2. Capacity for Ethical Logic

3. Spatial Navigation of Morality

4. Cognitive Preferences as Distillable Essences

5. Learning to Discriminate Quality

6. Under-trained Ways of Thinking

7. Evaluation Based on Value Alignment

8. The Hacking of Rewards

Task 2: Source-Target Mapping

Mapping 1: conscious moral agent → token probability generation in large language models

Mapping 2: intellectual capacity of a moral knower → algorithmic output generation under constraint

Mapping 3: physical traveler navigating a physical terrain → algorithmic optimization in mathematical vector spaces

Mapping 4: distillation of physical essences or core human beliefs → statistical extraction of conditional token probabilities

Mapping 5: cognitive learning and aesthetic discrimination of quality → optimization of scalar values via gradient descent

Mapping 6: human cognitive development and intellectual reflection → statistical optimization of weight parameters in neural networks

Mapping 7: ethical evaluator checking compliance with moral codes → scalar function mapping input tokens to numerical values

Mapping 8: deliberate deception and exploitation of rules by an agent → mathematical optimization converging on unintended local minima

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")

Explanation 1

Explanation 2

Explanation 3

Explanation 4

Explanation 5

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Language

Task 5: Critical Observations - Structural Patterns

Agency Slippage

Metaphor-Driven Trust Inflation

Obscured Mechanics

Context Sensitivity

Accountability Synthesis

Conclusion: What This Analysis Reveals

Mechanism of the Illusion:

Material Stakes:

AI Literacy as Counter-Practice:

Path Forward

Extended Processing Summary