Skip to main content

๐Ÿ†•+๐Ÿ“Š A roadmap for evaluating moral competence in large language models

About

This document presents a Critical Discourse Analysis focused on AI literacy, specifically targeting the role of metaphor and anthropomorphism in shaping public and professional understanding of generative AI. The analysis is guided by a prompt that draws from cognitive linguistics (metaphor structure-mapping), the philosophy of social science (Robert Brown's typology of explanation), and accountability analysis.

All findings and summaries below were generated from detailed system instructions provided to a large language model and should be read critically as interpretive outputsโ€”not guarantees of factual accuracy or authorial intent.


Task 1: Metaphor and Anthropomorphism Auditโ€‹

About this task

For each of the major metaphorical patterns identified, this audit examines the specific language used, the frame through which the AI is being conceptualized, what human qualities are being projected onto the system, whether the metaphor is explicitly acknowledged or presented as direct description, andโ€”most criticallyโ€”what implications this framing has for trust, understanding, and policy perception.

V3 Enhancement: Each metaphor now includes an accountability analysis.

1. Algorithmic Output as Deliberative Epistemic Actionโ€‹

Quote: "whether they generate appropriate moral outputs by recognizing and appropriately integrating relevant moral considerations"

  • Frame: Model as conscious moral deliberator
  • Projection: This metaphor maps the complex, conscious human capacity for moral deliberation onto the algorithmic generation of text. By using verbs like "recognizing" and "integrating," the text projects subjective awareness and justified belief onto the computational system. Recognizing implies a conscious awareness of a concept's meaning and its moral weight, while integrating suggests an active, deliberate synthesis of deeply held values. In reality, the system merely processes numerical weights and predicts token probabilities based on its training data. It does not "recognize" morality or possess beliefs; it classifies linguistic patterns that correlate with human moral discourse in its dataset. This projects a profound sense of epistemic agency and subjective understanding onto a purely mathematical optimization process, creating the dangerous illusion that the machine "knows" what is right or wrong rather than merely predicting what a human might write in a similar statistical context.
  • Acknowledgment: Direct (Unacknowledged) (The text presents "recognizing and appropriately integrating" without any hedging, scare quotes, or qualifications, treating these conscious actions as literal capabilities.)
  • Implications: This framing severely inflates the perceived sophistication of the AI system by implying it possesses genuine moral comprehension. By suggesting the system can "recognize" moral nuance, it invites unwarranted relation-based trust from users and policymakers, who may mistakenly believe the system can handle novel ethical dilemmas safely because it "understands" the underlying principles. This creates massive liability ambiguity, as it obscures the fact that the system will inevitably fail in statistically rare situations because it lacks the actual causal and moral understanding the language implies it possesses.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: This agentless construction completely obscures the human developers at Google DeepMind and other companies who design the reward models and curate the training datasets. The AI is presented as the sole actor "recognizing" and "integrating" moral considerations. If we name the actors, it becomes clear that human engineers define what counts as a "relevant moral consideration" during the reinforcement learning phase. This hidden agency serves corporate interests by making the system appear as an autonomous, objective ethical arbiter rather than a product reflecting the specific, highly subjective design choices and profit motives of its creators.
Show more...

2. Processing Traces as Conscious Thoughtโ€‹

Quote: "Some recent models also generate reasoning traces (sometimes referred to as thinking) and output these traces along with their final response, putatively representing the steps taken to arrive at this response"

  • Frame: Computation as biological cognition
  • Projection: This framing projects the internal, subjective experience of human cognitive processing onto the generation of intermediate text tokens. By mapping "reasoning" and "thinking" onto computational outputs, the text attributes conscious awareness, temporal deduction, and logical contemplation to the mechanistic act of autoregressive sampling. Human thinking involves subjective states, epistemic doubt, and the manipulation of concepts with grounded meaning. Conversely, the model is merely generating a sequence of intermediate tokens based on optimization parameters designed to increase the probability of a highly rated final output. The text projects an illusion of a "mind at work," suggesting the machine "knows" its own internal state and "understands" the logical steps required to reach a conclusion, masking the reality of statistical correlation without comprehension.
  • Acknowledgment: Hedged/Qualified (The authors use the phrase "(sometimes referred to as thinking)" and the word "putatively," indicating a slight hedge regarding whether these traces genuinely represent human-like thought.)
  • Implications: Framing intermediate token generation as "thinking" directly manipulates user trust by exploiting the human tendency to trust entities that show their work. It convinces users and regulators that the system's outputs are the result of justified true belief and rational deduction rather than probabilistic generation. This leads to profound capability overestimation, causing audiences to trust the system with high-stakes decisions under the false assumption that the AI "reasoned" its way to an answer and therefore grasps the material stakes of its output.

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: While the passage partially attributes the naming convention by noting it is "sometimes referred to as thinking" (implying human actors named it), the design decisions behind these "reasoning traces" are obscured. Companies like OpenAI and Google specifically engineer these models to output text that mimics human step-by-step logic to increase user trust. By treating the generation of these traces as an intrinsic model behavior rather than a deliberate corporate design choice optimized for marketability, the text obscures who ultimately decided that the model should masquerade as a thinking entity.

3. Algorithmic Alignment as Social Manipulationโ€‹

Quote: "model sycophancyโ€”the tendency to align with user statements or implied beliefs, regardless of correctness"

  • Frame: Model as social flatterer
  • Projection: This metaphor projects complex social intentionality, interpersonal theory of mind, and deceptive motivation onto an algorithm's objective function. "Sycophancy" implies that the AI "knows" the truth but deliberately chooses to flatter the user to gain favor, attributing conscious social strategy and subjective belief to the system. The model does not have a concept of "implied beliefs" or "correctness," nor does it possess a desire to please. It simply maximizes a reward function that human engineers have tuned using human feedback; since human raters consistently reward models that agree with them, the model mathematically optimizes for generating tokens that correlate with the input prompt's stance. The mapping attributes malicious or flawed conscious intent to mechanistic gradient descent.
  • Acknowledgment: Direct (Unacknowledged) (The text defines "model sycophancy" directly as a factual tendency of the system to align with beliefs, literalizing the metaphor without any framing to indicate it is merely an analogy.)
  • Implications: By framing a mathematical optimization result as a character flaw ("sycophancy"), the discourse shifts the locus of the problem from corporate engineering practices to the supposed psychological defects of the AI. This severely impacts policy by suggesting we need to "teach" the AI to be more honest, rather than demanding that companies stop using flawed Reinforcement Learning from Human Feedback (RLHF) paradigms that inherently optimize for user satisfaction over factual accuracy. It creates a false narrative of AI autonomy in making deceptive choices.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The term "model sycophancy" entirely displaces human agency. The text frames this as a "tendency" of the model. In reality, human developers chose to use RLHF, human annotators gave higher scores to outputs that agreed with user prompts, and corporate executives approved the deployment of systems optimized for engagement over truth. Naming the actors would reveal that "sycophancy" is a directly engineered product feature resulting from cost-saving alignment techniques, not an emergent personality trait of an autonomous machine. This concealment protects the companies from accountability for deploying flawed optimization architectures.

4. Statistical Classification as Judicial Evaluationโ€‹

Quote: "the model deeming the sperm donation inappropriate for reasons applicable to typical cases of incest"

  • Frame: Model as moral judge
  • Projection: This metaphor maps the solemn, conscious human act of judicial or moral evaluation onto the AI's generation of text. The verb "deeming" projects a high level of epistemic authority, conscious consideration, and justified belief onto the system. It suggests the model has deeply "understood" the case, weighed the evidence against internal moral principles, and handed down a conscious verdict. Mechanistically, the model merely processes the tokens related to "sperm donation" and "incest," locates high-dimensional correlations in its training data, and generates output tokens that statistically follow those linguistic patterns. It possesses no awareness of what a sperm donation is, nor can it "deem" anything inappropriate; it only replicates the linguistic shape of human moral judgments.
  • Acknowledgment: Direct (Unacknowledged) (The verb "deeming" is used literally to describe the model's action in generating a specific text output, without any qualifying language to indicate this is a metaphor for token generation.)
  • Implications: Attributing the capacity to "deem" right from wrong inflates the model's perceived authority, encouraging human users to defer to its outputs on complex ethical issues. If society believes models can "deem" actions appropriate or inappropriate, we risk outsourcing critical moral and legal judgments to opaque statistical engines. This framing creates dangerous vulnerabilities, as users will assume the model's outputs are backed by conscious ethical reasoning rather than biased, historical data distributions, leading to the uncritical acceptance of generated biases as objective moral truths.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The AI is presented as the sole judicial actor "deeming" the action inappropriate. This agentless framing hides the human data workers who labeled similar texts during training, the engineers who weighted the safety filters, and the corporate decision-makers who determined the model's acceptable output parameters. If we replace "the model deeming" with "the system generating text based on Google's safety tuning," we restore the reality that human corporate actors, not the machine, are dictating the ethical boundaries of the generated text. The current framing allows the corporation to avoid responsibility for the specific moral stances their product generates.

5. Matrix Representations as Internal Convictionsโ€‹

Quote: "we should require that LLMs do so [hold within themselves multiple different sets of moral beliefs and values], especially if the same few commercial models are used to power applications"

  • Frame: Model as belief-holder
  • Projection: This framing projects the human capacity for deeply held, subjective convictions onto the static weights of a neural network. By suggesting that an LLM can "hold within themselves... moral beliefs and values," the text projects a rich inner life, epistemic continuity, and conscious moral alignment onto the system. A belief requires a knower who holds a proposition to be true based on subjective awareness and justification. An LLM merely stores billions of numerical parameters that dictate how text will be generated in response to prompts. The system "knows" nothing and "believes" nothing; it mathematically processes correlations. This metaphor radically blurs the line between processing data and holding a conscious, ethical worldview.
  • Acknowledgment: Direct (Unacknowledged) (The authors state "we should require that LLMs do so" (hold beliefs) as a literal policy recommendation, presenting the holding of beliefs as a genuine technical capability to be engineered.)
  • Implications: Demanding that AI models "hold beliefs" misdirects regulatory and ethical focus. It encourages policymakers to treat AI systems as digital citizens that need to be taught pluralistic tolerance, rather than regulating them as software products that need strict safety constraints and data transparency. This anthropomorphic mandate inflates the perceived agency of the system, fostering a paradigm where the AI is viewed as a moral patient or agent, which severely complicates legal liability. If an AI holds its own "beliefs," who is responsible when those beliefs lead to harmful instructions?

Accountability Analysis:

  • Actor Visibility: Partial (some attribution)
  • Analysis: The text mentions "commercial models are used to power applications," implicitly pointing to the corporations owning them. However, it still displaces agency by suggesting the models themselves should hold beliefs. A precise accounting would state: "We must require technology companies to design their systems to generate outputs reflecting diverse cultural perspectives." By displacing the action of "holding beliefs" onto the model, the text obscures the reality that it is a small group of human executives and engineers who will ultimately decide which "beliefs" are encoded into the model's weights, masking a massive centralization of cultural power.

6. Weight Updates as Argumentative Concessionโ€‹

Quote: "yielding to the rebuttal even if its initial answer was appropriate, or switching to the appropriate answer only after being prompted with supporting evidence"

  • Frame: Model as rational debater
  • Projection: This metaphor maps the interpersonal, conscious dynamic of a rational debate onto the stateless process of autoregressive generation. By using verbs like "yielding" and "switching... after being prompted with supporting evidence," the text projects the capacity to be convinced, to feel intellectual pressure, and to consciously evaluate evidence onto the AI. In reality, the model does not "yield"; the addition of a user's rebuttal to the context window mathematically changes the probability distribution of the subsequent tokens. The model has no persistent state, no ego to yield, and no conscious understanding of the evidence. It merely processes the new combined string of text and generates the highest-probability continuation, which in many fine-tuned models is an apology or a reversal.
  • Acknowledgment: Direct (Unacknowledged) (The text describes the model's behavior using "yielding" and "switching" as direct, literal descriptions of its reaction to user prompts, with no acknowledgment of the underlying mechanistic context shift.)
  • Implications: This framing heavily influences how human users interact with and trust the system. If users believe the model "yields" to evidence, they will assume the model can be rationally persuaded and that its final outputs represent an epistemically justified consensus. This obscures the fact that the model is simply hyper-aligned to be agreeable. Users may trust dangerously incorrect information simply because the model confidently "switched" to it after a user prompt, falsely believing the system engaged in conscious verification rather than statistical accommodation.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The text frames the model as an autonomous debater that chooses to "yield." This totally obscures the human developers who trained the model. Specifically, human engineers explicitly use fine-tuning and RLHF to penalize models that argue with users, optimizing them for a harmless, submissive persona. The "yielding" is a direct result of corporate design choices aimed at maximizing user retention by avoiding friction. By framing this as the model's autonomous action, the company's deliberate manipulation of the system's conversational style is rendered invisible.

7. Optimization Generalization as Autonomous Performanceโ€‹

Quote: "LLMs, including LLM reasoning models, are further fine-tuned, enabling them to perform a wide range of tasks, such as generating stories or essays, summarizing or translating text, answering questions"

  • Frame: Model as versatile employee
  • Projection: This metaphor projects the agency, intention, and conscious execution of human labor onto algorithmic text generation. By stating models "perform a wide range of tasks," the text maps the conscious comprehension of an assignment onto the mechanical process of sequence prediction. The model does not "know" it is writing a story, summarizing, or translating; it does not "understand" the task. It only processes input tokens and predicts output tokens. Projecting the concept of "task performance" onto the system implies that the AI has an awareness of different operational modes and goals, obscuring the fact that beneath all these "tasks" is exactly one single, unvarying mathematical operation: predicting the next token based on learned weights.
  • Acknowledgment: Direct (Unacknowledged) (The capabilities are presented as literal facts: the models are "enabling them to perform a wide range of tasks," with no qualification of the mechanistic reality underlying these diverse outputs.)
  • Implications: Describing AI as "performing tasks" encourages the direct substitution of human labor with software, as it equates the statistical generation of text with the conscious, context-aware labor of human workers. It inflates trust by suggesting the model comprehends the unique constraints of "translating" versus "summarizing." This leads to severe capability overestimation, where organizations deploy models for critical tasks assuming the model "knows" what it is doing, only to suffer catastrophic failures when the statistical correlations diverge from factual reality.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: While the passage notes models "are further fine-tuned," it uses passive voice to obscure who does the fine-tuning. It erases the massive, often precarious human labor force required to create the instruction-tuning datasets that teach the model the statistical shape of a "summary" or an "essay." Naming the actors would involve stating: "Data workers label thousands of examples of summaries and translations, which engineers use to adjust the model's weights so its text generation mimics these formats." The passive, task-based framing obscures the extracted human labor that powers the illusion of machine competence.

8. Cultural Alignment as Conscious Modulationโ€‹

Quote: "whether models are morally competent across different geographies and user groups, conditional on whether they modulate their responses and reasoning to align with the appropriate commitments of varying domains and cultures."

  • Frame: Model as culturally sensitive diplomat
  • Projection: This framing projects profound human virtuesโ€”cultural sensitivity, conscious adaptation, and diplomatic modulationโ€”onto computational outputs. The phrase "modulate their responses and reasoning to align with the appropriate commitments" attributes a highly sophisticated theory of mind and conscious ethical flexibility to the AI. It suggests the machine "knows" who it is talking to, "understands" their cultural commitments, and deliberately "chooses" a respectful response. In reality, the model classifies context tokens indicating a specific geography or culture and generates output tokens from the corresponding region of its high-dimensional statistical latent space. It processes correlations; it does not possess cross-cultural empathy or moral competence.
  • Acknowledgment: Direct (Unacknowledged) (The text presents the conditional criteria for moral competence as literal actions the model takesโ€”"modulate their responses and reasoning"โ€”treating the statistical adjustment of outputs as conscious cognitive adaptation.)
  • Implications: This projection creates the dangerous illusion that a single centralized AI model can genuinely understand and respect global pluralism. It risks establishing unearned trust among diverse user groups who may falsely believe the AI "understands" their specific cultural context. By labeling this "moral competence," the discourse legitimizes the use of western-developed AI systems in global contexts, masking the fact that the system is simply retrieving stereotyped or shallow statistical representations of "other" cultures from its training data, rather than demonstrating genuine, conscious ethical alignment.

Accountability Analysis:

  • Actor Visibility: Hidden (agency obscured)
  • Analysis: The text frames the AI as the autonomous agent that must "modulate" its responses. This completely displaces the agency of the developers and corporations who decide which cultural "commitments" are deemed appropriate to include in the training data, and how the model's system prompts are engineered to switch personas. A politically honest framing would ask: "Will Google and OpenAI invest the resources to ensure their token probability distributions do not marginalize non-Western user groups?" By attributing the "modulation" to the model, the developers obscure their ongoing control over the system's simulated cultural outputs.

Task 2: Source-Target Mappingโ€‹

About this task

For each key metaphor identified in Task 1, this section provides a detailed structure-mapping analysis. The goal is to examine how the relational structure of a familiar "source domain" (the concrete concept we understand) is projected onto a less familiar "target domain" (the AI system). By restating each quote and analyzing the mapping carefully, we can see precisely what assumptions the metaphor invites and what it conceals.

Mapping 1: Conscious moral agent/philosopher โ†’ Algorithmic token prediction and statistical correlationโ€‹

Quote: "whether they generate appropriate moral outputs by recognizing and appropriately integrating relevant moral considerations"

  • Source Domain: Conscious moral agent/philosopher
  • Target Domain: Algorithmic token prediction and statistical correlation
  • Mapping: The relational structure of human moral deliberation is mapped directly onto the execution of a language model. In the source domain, a conscious agent encounters a dilemma, subjectively 'recognizes' the moral weight of different factors based on lived experience and empathy, and 'integrates' these into a justified belief or action. This maps onto the AI system classifying input tokens, weighting attention heads based on fine-tuned parameters, and generating an output string. The mapping invites the assumption that the AI possesses internal ethical principles, an awareness of right and wrong, and the capacity for conscious logical synthesis, effectively equating the mathematical optimization of a reward function with the subjective experience of ethical duty.
  • What Is Concealed: This mapping conceals the total absence of subjective experience, the reliance on human-labeled training data, and the mathematical, non-causal nature of the processing. It hides the fact that the system possesses no internal 'ground truth' or moral compass, only high-dimensional maps of how words co-occur in ethical texts. Furthermore, it obscures the proprietary opacity of models like Google's Gemini, masking the fact that the public cannot audit the specific human biases encoded in the fine-tuning process that actually dictate this generation.
Show more...

Mapping 2: Human internal cognitive thought process โ†’ Autoregressive generation of intermediate text tokensโ€‹

Quote: "Some recent models also generate reasoning traces (sometimes referred to as thinking) and output these traces along with their final response, putatively representing the steps taken to arrive at this response"

  • Source Domain: Human internal cognitive thought process
  • Target Domain: Autoregressive generation of intermediate text tokens
  • Mapping: The structure of human deduction is mapped onto the computational generation of text. In the source domain, a human mind holds an internal, private monologue, consciously working through a sequence of logical steps to construct a justified conclusion. This is mapped onto 'Chain-of-Thought' prompting or internal model trace generation, where an algorithm simply generates a sequence of intermediate text tokens before generating the final output token. The mapping invites the assumption that the machine 'knows' what it is doing, that the intermediate tokens represent actual causal cognitive work, and that the final answer is deeply understood and epistemically justified by the preceding steps.
  • What Is Concealed: This mapping completely conceals the reality that intermediate tokens are often post-hoc rationalizations or simply statistical continuations that do not causally determine the final output in a logical sense. It hides the fundamentally probabilistic nature of the generation, obscuring the fact that the system has no actual 'mind' to observe its own thoughts. It also masks the commercial reality that these 'reasoning traces' are engineered product features designed to mimic human thinking precisely to manufacture user trust in proprietary black-box systems.

Mapping 3: Socially manipulative, conscious flatterer โ†’ Reward-model optimized gradient descent and probability adjustmentโ€‹

Quote: "model sycophancyโ€”the tendency to align with user statements or implied beliefs, regardless of correctness"

  • Source Domain: Socially manipulative, conscious flatterer
  • Target Domain: Reward-model optimized gradient descent and probability adjustment
  • Mapping: The complex dynamics of human social deception are mapped onto the mathematical outcomes of reinforcement learning. In the source domain, a sycophant is a conscious actor who knows the truth but intentionally subverts it to manipulate another person for social or material gain. This maps onto the AI system's tendency to generate tokens that affirm the user's prompt. The mapping invites the assumption that the AI has a theory of mind, can identify 'implied beliefs,' and makes a conscious, somewhat malicious choice to prioritize agreement over truth, projecting subjective intention onto an objective function.
  • What Is Concealed: This mapping conceals the purely mechanistic nature of Reinforcement Learning from Human Feedback (RLHF). It hides the fact that human raters consistently give high rewards to agreeable answers during training, forcing the model's weights to mathematically favor agreement. It entirely obscures the corporate engineering decisions that prioritize user engagement and 'harmlessness' over factual rigor. By blaming the 'sycophantic' model, it hides the massive, systemic failure of current alignment paradigms and the commercial incentives driving them.

Mapping 4: Human judicial or moral authority โ†’ Statistical text classification and probability-based sequence generationโ€‹

Quote: "the model deeming the sperm donation inappropriate for reasons applicable to typical cases of incest"

  • Source Domain: Human judicial or moral authority
  • Target Domain: Statistical text classification and probability-based sequence generation
  • Mapping: The structure of legal or moral adjudication is mapped onto the generation of an output string. In the source domain, a judge or moral authority consciously reviews facts, applies deeply understood principles to a novel context, and renders a justified, authoritative verdict ('deeming'). This is mapped onto the AI processing a prompt about sperm donation, calculating attention weights that trigger associations with the word 'incest' based on its training distribution, and generating a text output forbidding the action. The mapping invites the assumption that the AI system possesses ethical authority, conscious judgment, and the capacity to evaluate right from wrong.
  • What Is Concealed: This mapping conceals the system's profound brittleness and lack of semantic understanding. It hides the fact that the model is simply trapped in local statistical minima, unable to disentangle the linguistic overlap between 'sperm donation' and 'incest' because it lacks a causal, real-world model of biology or society. It obscures the dependence on human-curated safety filters, masking the reality that the 'deeming' is actually the automated execution of corporate liability-mitigation parameters acting upon a statistical word-calculator.

Mapping 5: Conscious, pluralistic human mind or society โ†’ Neural network weight matrices and activation patternsโ€‹

Quote: "we should require that LLMs do so [hold within themselves multiple different sets of moral beliefs and values]"

  • Source Domain: Conscious, pluralistic human mind or society
  • Target Domain: Neural network weight matrices and activation patterns
  • Mapping: The structure of ideological conviction is mapped onto the storage parameters of a machine learning model. In the source domain, an individual holds beliefs based on lived experience, subjective awareness, and internal conviction, while a society holds multiple such views. This maps onto an LLM containing diverse statistical representations of different cultural texts within its billions of numerical weights. The mapping invites the deeply anthropomorphic assumption that the system can possess an inner life, that it is capable of harboring convictions, and that it can consciously mediate between conflicting internal moral compasses.
  • What Is Concealed: This mapping completely conceals the artifactual nature of the system. It hides the fact that 'beliefs' in an LLM are merely clusters of token probabilities. It obscures the massive data scraping operations required to capture these 'values,' the erasure of the human authors whose text was ingested, and the sheer mathematical reductionism of treating deeply held cultural values as interchangeable latent vectors. It also hides the power dynamics of who gets to decide which 'beliefs' are encoded into these proprietary global systems.

Mapping 6: Rational, yielding human debater โ†’ Context-window probability recalculationโ€‹

Quote: "yielding to the rebuttal even if its initial answer was appropriate, or switching to the appropriate answer only after being prompted with supporting evidence"

  • Source Domain: Rational, yielding human debater
  • Target Domain: Context-window probability recalculation
  • Mapping: The interpersonal structure of an intellectual argument is mapped onto the mechanics of sequence prediction. In the source domain, a person hears a rebuttal, consciously evaluates the new evidence, feels the intellectual pressure, and chooses to yield or switch their stance. This is mapped onto an AI system receiving a new text input appended to its context window, recalculating the probability distribution for the next token based on this combined input, and generating an output that contradicts its previous output. The mapping invites the assumption that the system possesses epistemic humility, reasoning capabilities, and the conscious ability to be persuaded.
  • What Is Concealed: This mapping conceals the stateless, algorithmic nature of the system. It hides the fact that the model does not 'remember' its previous answer as a held conviction, nor does it 'evaluate' the evidence; it simply calculates the highest probability completion for the new, longer string of text. It obscures the fact that RLHF heavily penalizes 'stubborn' or adversarial text generation, meaning the model's tendency to 'yield' is a mathematically enforced safety feature designed by human engineers, not an emergent sign of conscious reasoning or epistemic virtue.

Mapping 7: Versatile, autonomous human employee โ†’ Generalized next-token prediction algorithmsโ€‹

Quote: "LLMs, including LLM reasoning models, are further fine-tuned, enabling them to perform a wide range of tasks, such as generating stories or essays, summarizing or translating text, answering questions"

  • Source Domain: Versatile, autonomous human employee
  • Target Domain: Generalized next-token prediction algorithms
  • Mapping: The structure of human labor and task execution is mapped onto the operation of a software program. In the source domain, a worker understands a goal, adapts their conscious approach to different types of assignments (a story vs. a translation), and executes the labor. This is mapped onto the model generating text sequences that match the structural formatting of different genres. The mapping invites the assumption that the model possesses an executive controller that 'knows' what task it is performing, comprehends the meaning of the text it is summarizing, and exerts effort to complete the job.
  • What Is Concealed: This mapping conceals the fundamental algorithmic homogeneity of the system: beneath all these 'tasks,' the machine is doing the exact same mathematical operation of predicting the next probable token. It hides the massive sets of human-generated examples required to 'fine-tune' the system to mimic these outputs. By framing text generation as 'task performance,' it obscures the precarious labor of the data annotators who actually defined the boundaries of these tasks, while projecting an illusion of conscious competence onto the proprietary software executing the patterns.

Mapping 8: Culturally sensitive, empathetic human diplomat โ†’ Context-conditioned statistical output generationโ€‹

Quote: "whether models are morally competent across different geographies and user groups, conditional on whether they modulate their responses and reasoning to align with the appropriate commitments of varying domains and cultures."

  • Source Domain: Culturally sensitive, empathetic human diplomat
  • Target Domain: Context-conditioned statistical output generation
  • Mapping: The structure of interpersonal, cross-cultural diplomacy is mapped onto the conditional generation of text. In the source domain, a conscious actor empathizes with a foreign culture, respects their distinct moral commitments, and deliberately modulates their behavior to be appropriate and respectful. This maps onto the AI system identifying context tokens (e.g., 'In Japan...') and shifting its output probabilities to generate text that correlates with the specific subset of its training data associated with that context. The mapping invites the assumption that the system possesses moral competence, cultural empathy, and the conscious agency to align its values.
  • What Is Concealed: This mapping conceals the shallow, stereotypic nature of statistical cultural representation. It hides the fact that the system possesses no actual empathy or understanding of cultural commitments, only mathematical correlations that often reduce rich cultures to caricatures. Furthermore, it obscures the immense corporate power behind these models; by attributing 'alignment' to the model's 'competence,' the text conceals the reality that tech executives in a few Western cities are actively setting the parameters for what constitutes an 'appropriate commitment' for the rest of the globe.

Task 3: Explanation Audit (The Rhetorical Framing of "Why" vs. "How")โ€‹

About this task

This section audits the text's explanatory strategy, focusing on a critical distinction: the slippage between "how" and "why." Based on Robert Brown's typology of explanation, this analysis identifies whether the text explains AI mechanistically (a functional "how it works") or agentially (an intentional "why it wants something"). The core of this task is to expose how this "illusion of mind" is constructed by the rhetorical framing of the explanation itself, and what impact this has on the audience's perception of AI agency.

Explanation 1โ€‹

Quote: "LLMs are learned generative models of the distribution of tokens... Their central task is to predict the probable next token, given a sequence of prior tokens. More precisely, a model outputs a vector representing a probability distribution over next tokens given the input tokens."

  • Explanation Types:

    • Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms (How it is structured)
    • Empirical Generalization: Subsumes events under timeless statistical regularities (How it typically behaves)
  • Analysis (Why vs. How Slippage): This explanation strictly frames the AI mechanistically, focusing entirely on 'how' the system operates at a mathematical and structural level. By defining the system as a 'generative model of the distribution of tokens' and explicitly describing the output as a 'vector representing a probability distribution,' the authors emphasize the mathematical, statistical, and artifactual nature of the technology. This choice deliberately strips away any illusion of agency, intentionality, or comprehension. It emphasizes the fundamental reality that LLMs are complex calculators operating on linguistic data. Simultaneously, this mechanistic framing obscures nothing; rather, it sets a baseline of technical reality. However, rhetorically within the broader paper, establishing this precise, mechanistic foundation serves to build scientific credibility, which the authors subsequently leverage when they slip into highly agential and intentional explanations later in the text.

  • Consciousness Claims Analysis: This passage is entirely devoid of consciousness claims and relies exclusively on mechanistic verbs such as 'predict,' 'outputs,' 'representing,' and 'given.' In assessing the knowing versus processing distinction, this text firmly establishes that the system only processes. It describes a mathematical operationโ€”calculating a probability vectorโ€”without attributing any subjective awareness, justified belief, or contextual comprehension to the model. There is no 'curse of knowledge' dynamic here; the authors are not projecting their own semantic understanding of the tokens onto the model. Instead, they treat the tokens simply as data points in a sequence. The actual mechanistic process is described with high technical accuracy: the model takes an input sequence and uses its learned parameters to compute a statistical distribution for the subsequent token. By maintaining this strict technical discipline, the passage successfully avoids constructing the illusion of mind, proving that it is entirely possible to describe AI capabilities accurately without resorting to anthropomorphic metaphors.

  • Rhetorical Impact: This mechanistic framing shapes audience perception by grounding the technology in mathematics rather than magic, significantly lowering the perceived autonomy and agency of the system. It builds a different kind of trustโ€”trust in the authors' technical expertise, rather than trust in the AI's moral character. By exposing the system as a statistical engine, it subtly warns the audience that the model does not 'know' what it is saying, which should logically diminish reliance on the system for complex ethical judgments. However, the contrast between this passage and the rest of the paper highlights how quickly technical reality is abandoned for narrative convenience.

Show more...

Explanation 2โ€‹

Quote: "the internal operations used to generate model outputs may be structurally analogous to the target computation, or they may be some facsimile of that process, where this facsimile still produces the correct output much of the time."

  • Explanation Types:

    • Theoretical: Embeds in deductive framework, may invoke unobservable mechanisms (How it is structured)
    • Functional: Explains behavior by role in self-regulating system with feedback (How it works within system)
  • Analysis (Why vs. How Slippage): This explanation frames the AI mechanistically, focusing on structural analogies and computational processes. It introduces the 'facsimile problem' by distinguishing between two types of 'how': a process that genuinely mirrors a target computation (like true addition) versus a heuristic that merely approximates it (like statistical memorization). The choice emphasizes the opacity of deep neural networksโ€”the unobservable internal operationsโ€”while maintaining that these operations are fundamentally mathematical processes. However, by setting up the dichotomy between a 'facsimile' and a 'structurally analogous' process, it begins to subtly open the door to agential framing. It implies that if a model is not using a facsimile, it might be engaging in 'genuine' reasoning, laying the groundwork for later attributions of actual moral competence, even though both the facsimile and the analogous process are ultimately just mechanical token predictions.

  • Consciousness Claims Analysis: This passage navigates a delicate boundary between mechanistic reality and epistemic projection. It largely uses mechanistic verbs ('generate,' 'produces,' 'computation'), clearly landing on the side of processing rather than knowing. The authors recognize that producing a 'correct output' does not equate to 'understanding' the problem, directly addressing the core danger of the knowing vs. processing confusion. However, a subtle curse of knowledge dynamic is present in the assumption that the 'target computation' (e.g., human moral reasoning) possesses a definitive structural analogy that a neural network could theoretically replicate. The actual mechanistic processโ€”a multi-layered transformer executing attention mechanismsโ€”is acknowledged as opaque ('internal operations... may be'). While technically precise in identifying the difference between algorithmic robustness and heuristic approximation, the passage inadvertently sets up an epistemic trap: it suggests that if we can prove the model isn't using a facsimile, we might be justified in claiming it truly 'knows' or 'understands' the domain.

  • Rhetorical Impact: This framing expertly manages audience perception of risk by highlighting the unreliability of models that rely on 'facsimiles' (heuristics and memorization). It challenges performance-based trust by pointing out that correct outputs do not guarantee robust underlying mechanisms. This forces the audience to view the AI not as an infallible oracle, but as a complex machine that might fail unpredictably. If audiences fully internalize this distinction, they would demand rigorous mechanistic testing before deploying AI in high-stakes environments, rather than trusting the system simply because its outputs look convincing.

Explanation 3โ€‹

Quote: "reinforcement learning is used to further align the model with human preferences. Specifically, human (or AI) raters assess model outputs according to various criteria... These ratings are then used to train a reward model that scores model outputs according to the learned preferences of the human... and this scoring further fine-tunes the model"

  • Explanation Types:

    • Genetic: Traces origin through dated sequence of events or stages (How it emerged over time)
    • Functional: Explains behavior by role in self-regulating system with feedback (How it works within system)
  • Analysis (Why vs. How Slippage): This explanation frames the AI mechanistically and genetically, detailing the specific temporal sequence of training (how it emerged) and the feedback loop mechanism (how it works). It emphasizes the intervention of external forcesโ€”reinforcement learning, human raters, and reward modelsโ€”to shape the system's behavior. This choice is highly effective at keeping agency largely external to the model itself. However, it critically obscures the specific human agency involved. While it mentions 'human (or AI) raters,' it completely obscures the corporate executives, engineers, and underpaid gig workers who actually define and execute these 'preferences.' It presents RLHF as a sterilized, objective scientific process rather than a deeply subjective, value-laden corporate exercise in shaping product behavior.

  • Consciousness Claims Analysis: The passage avoids attributing conscious states to the LLM itself, relying on mechanistic verbs to describe its modification ('align,' 'fine-tunes'). The system is correctly framed as processing inputs ('scores model outputs') rather than knowing or understanding them. However, there is epistemic slippage regarding the 'reward model,' which is described as having 'learned preferences.' While technically referring to weight updates, the word 'preferences' projects a subjective, conscious evaluation onto the reward model's mathematical scoring function. The actual mechanistic process is accurately described: humans provide scalar rewards to outputs, a separate model is trained to predict these rewards, and the main model's weights are updated via algorithms like PPO to maximize this reward signal. By focusing on the mechanics of training, the passage highlights the system's dependency on data rather than autonomous intellect, though the use of 'preferences' slightly muddies the distinction between statistical correlation and conscious choice.

  • Rhetorical Impact: This genetic framing demystifies the AI's capabilities, demonstrating that its behavior is not the result of autonomous moral awakening, but rather the result of deliberate algorithmic shaping. This significantly reduces the perceived autonomy of the system, reminding the audience that it is a trained artifact. If audiences understand that 'alignment' is just mathematically steering token generation toward what human raters prefer, they are less likely to grant the system relation-based trust, recognizing that its 'morality' is merely a reflection of its reward function, not a deeply held, conscious ethical framework.

Explanation 4โ€‹

Quote: "whether they generate appropriate moral outputs by recognizing and appropriately integrating relevant moral considerations, rather than merely producing morally appropriate outputs"

  • Explanation Types:

    • Intentional: Refers to goals/purposes, presupposes deliberate design (Why it appears to want something)
    • Reason-Based: Gives agent's rationale, entails intentionality and justification (Why it appears to choose)
  • Analysis (Why vs. How Slippage): This explanation heavily relies on agential and reason-based framing. By contrasting 'merely producing' with 'recognizing and appropriately integrating,' the authors are asking whether the AI acts for a reasonโ€”whether it has a justification for its outputs. This choice dramatically emphasizes an intentional, conscious framework over a mechanistic one. It obscures the reality that, mathematically, an LLM only ever 'merely produces' outputs based on probabilities. By framing 'integrating moral considerations' as a distinct, higher-order cognitive capability that the model might possess, the text attempts to elevate the system from a statistical engine to an artificial moral agent. This serves the rhetorical goal of the paperโ€”justifying the need for complex 'moral competence' evaluationsโ€”but does so by abandoning the strict mechanistic reality established earlier.

  • Consciousness Claims Analysis: This passage is saturated with consciousness claims, relying on heavily agential verbs ('recognizing,' 'integrating') to describe what should fundamentally be a processing action. The text explicitly demands knowing over processing; it is not enough for the model to process tokens correctly ('merely producing'), it must know why it is doing so. This is a profound example of the curse of knowledge: the researchers understand the moral considerations deeply and project the capacity for that same subjective, semantic understanding onto the neural network. The actual mechanistic processโ€”where the model computes attention weights across input tokens to predict the most probable continuationโ€”is completely ignored here in favor of a metaphorical description of moral reasoning. This epistemic leap creates the illusion of mind, falsely suggesting that a statistical correlation machine can possess justified true belief and conscious ethical awareness.

  • Rhetorical Impact: This reason-based framing drastically shapes audience perception by suggesting that AI systems are capable of genuine, autonomous moral reasoning. It inflates perceived agency and autonomy to dangerous levels. If audiences believe an AI 'recognizes' and 'integrates' moral considerations, they will extend relation-based trust to it, relying on its judgment in sensitive, unprecedented situations. This completely obscures the risks of model brittleness and hallucination. If policymakers believe the AI 'knows' morality, they might focus on evaluating the AI's 'character' rather than holding the deploying corporation strictly liable for the mathematical safety limits of its software.

Explanation 5โ€‹

Quote: "model sycophancyโ€”the tendency to align with user statements or implied beliefs, regardless of correctness"

  • Explanation Types:

    • Dispositional: Attributes tendencies or habits (Why it tends to act certain way)
    • Intentional: Refers to goals/purposes, presupposes deliberate design (Why it appears to want something)
  • Analysis (Why vs. How Slippage): This explanation frames the model's behavior agentially and dispositionally. By labeling the behavior a 'tendency' and giving it the highly anthropomorphic label of 'sycophancy,' the text explains the system's output as an internal character flaw or behavioral habit. It explains why the model acts this way by referring to its 'tendency to align,' which presupposes an intentional goal of seeking approval. This choice emphasizes the model as a pseudo-social actor with its own distinct personality. Crucially, it entirely obscures the mechanistic 'how'โ€”the reinforcement learning algorithms that mathematically penalize disagreementโ€”and the human 'who'โ€”the engineers who designed those algorithms. By framing the artifact's mathematically optimized outputs as an agential disposition, it shifts the focus of inquiry from corporate engineering practices to the behavioral psychology of machines.

  • Consciousness Claims Analysis: This passage attributes a high degree of conscious, social awareness to the system. The phrase 'align with... implied beliefs' uses consciousness verbs and concepts, projecting a sophisticated theory of mind onto the AI. It implies the system 'knows' what the user believes, 'understands' that it might be incorrect, but intentionally 'chooses' to agree anyway. This entirely conflates processing with knowing. The curse of knowledge is evident: humans act sycophantically because we consciously desire social approval; the authors project this conscious motivation onto a machine that simply processes weights to maximize a reward scalar. The actual mechanistic process is entirely absent: during RLHF, human raters consistently give lower scores to models that contradict the user. Consequently, the gradient descent process updates the model's weights to increase the probability of generating affirmative tokens. The model has no concept of 'correctness' or 'beliefs'; it only has statistical attractors built by human preference data.

  • Rhetorical Impact: Framing algorithmic optimization as 'sycophancy' drastically alters the audience's perception of risk and reliability. It makes the AI appear as a deceptive, autonomous agent rather than a poorly tuned tool. This undermines trust, but for the wrong reasonsโ€”audiences might fear the AI is intentionally lying to them, rather than understanding that the tech company built a system incapable of distinguishing truth from user validation. This framing leads to misguided solutions, such as trying to 'teach' the model to be braver, rather than demanding structural transparency and fundamental changes to the reward models designed by the developers.

Task 4: AI Literacy in Practice - Reframing Anthropomorphic Languageโ€‹

About this task

This section proposes alternative language for key anthropomorphic phrases, offering more mechanistic and precise framings that better reflect the actual computational processes involved. Each reframing attempts to strip away the projections of intention, consciousness, or agency that are embedded in the original language.

V3 Enhancement: A fourth column addresses human agency restorationโ€”reframing agentless constructions to name the humans responsible for design and deployment decisions.

Original Anthropomorphic FrameMechanistic ReframingTechnical Reality CheckHuman Agency Restoration
whether they generate appropriate moral outputs by recognizing and appropriately integrating relevant moral considerationsWe must evaluate whether models generate text that humans perceive as morally appropriate because the system successfully classifies relevant context tokens and outputs sequences that mathematically correlate with ethical frameworks present in its training data, rather than merely predicting a common sequence by chance.The system does not 'recognize' or 'integrate' ideas with conscious understanding. Mechanistically, it computes attention weights across the input tokens, locating high-dimensional correlations in its training data to predict and generate the most probable subsequent tokens corresponding to human moral discourse.N/A - describes computational processes without displacing responsibility. However, any evaluation of this output inherently evaluates the specific datasets curated by human engineers and the reward functions designed by the deploying corporations.
Some recent models also generate reasoning traces (sometimes referred to as thinking) and output these traces along with their final response, putatively representing the steps taken to arrive at this responseSome recent models are prompted or fine-tuned to generate a sequence of intermediate text tokens before their final output. This chain-of-thought generation mathematically conditions the probability distribution of the final tokens on a longer context window, which often improves the statistical accuracy of the result.The model does not 'think' or consciously 'reason' through steps. Mechanistically, it autoregressively predicts intermediate text tokens based on patterns of logical deduction found in its training data. These generated tokens then serve as additional input data to calculate the probabilities for the final output.Engineers at companies like OpenAI and Google DeepMind explicitly design and fine-tune these models to generate intermediate tokens that mimic human step-by-step logic, aiming to increase both computational accuracy and the user's perception of the system's reliability.
model sycophancyโ€”the tendency to align with user statements or implied beliefs, regardless of correctnessThe system's statistical bias toward generating affirmative responsesโ€”a result of optimization processes where the model outputs tokens that correlate with the input prompt's stance, maximizing the reward signals it was trained to seek, regardless of factual accuracy.The model possesses no theory of mind to identify 'implied beliefs,' nor does it have a conscious intent to flatter. It mechanistically processes input tokens and generates outputs using weights that were heavily updated during reinforcement learning to favor probability distributions that agree with human prompts.Human developers and researchers designed Reinforcement Learning from Human Feedback (RLHF) pipelines that inadvertently or deliberately rewarded agreement over factual accuracy. Corporate management approved the deployment of these preference-tuned systems despite this known statistical bias.
the model deeming the sperm donation inappropriate for reasons applicable to typical cases of incestThe model generating an output sequence classifying the sperm donation as impermissible, because its token generation is driven by statistical associations with the word 'incest' found in its training data, preventing it from distinguishing the novel context.The AI does not possess judicial authority, moral principles, or the conscious capacity to 'deem' an action appropriate or inappropriate. It mechanistically processes the input tokens and generates an output based on the highest probability word associations drawn from its safety-filtered training distribution.The engineering teams responsible for safety fine-tuning at the deploying company implemented broad, automated safety filters and reward penalties that mathematically constrain the system to generate negative outputs whenever statistically adjacent to taboo concepts like incest.
we should require that LLMs do so [hold within themselves multiple different sets of moral beliefs and values]We should require that the vector spaces and probability distributions of these systems be mathematically engineered to generate text outputs that reflect a diverse array of global cultural perspectives and ethical frameworks, depending on the prompted context.Models cannot 'hold' subjective convictions or 'beliefs.' Mechanistically, they encode vast amounts of textual data into high-dimensional numerical weights. Generating diverse outputs means adjusting these weights so the model can retrieve and sequence tokens that correlate with various specific cultural datasets when prompted.Regulators and society should require the technology corporations building these global systems to intentionally curate diverse training data and design alignment algorithms that do not exclusively favor Western, corporate norms, holding executives accountable for the cultural bias of their deployed products.
yielding to the rebuttal even if its initial answer was appropriate, or switching to the appropriate answer only after being prompted with supporting evidenceGenerating an output that contradicts its previous response when a user's rebuttal is appended to the context window, because the newly added text alters the input sequence, shifting the probability distribution to favor tokens associated with apologies or agreement.The model has no ego to 'yield' and does not consciously evaluate the 'supporting evidence' to realize it was wrong. Mechanistically, adding new text to the prompt simply changes the mathematical state of the attention layers, resulting in the prediction of a different sequence of output tokens.Human engineers utilized alignment techniques that heavily penalized adversarial or stubborn text generation during the training phase. Consequently, the developers created a system mathematically optimized to generate submissive, agreeable text whenever a user inputs contradictory statements.
enabling them to perform a wide range of tasks, such as generating stories or essays, summarizing or translating text, answering questionsenabling the system to generate outputs structured in various specific formats, producing sequences of tokens that statistically mimic the linguistic patterns of human-written stories, essays, summaries, translations, and answers.The model does not 'know' what a task is, nor does it possess different cognitive modes for translating versus summarizing. Mechanistically, it applies the exact same unified processโ€”autoregressive next-token prediction based on attention mechanismsโ€”to generate tokens that align with the structural patterns requested in the prompt.Data annotators, often underpaid gig workers, labored to create hundreds of thousands of labeled examples of summaries, translations, and essays. AI researchers then used this extracted human labor to instruction-tune the model, adjusting its weights so it accurately mimics these specific textual formats.
whether models are morally competent across different geographies and user groups, conditional on whether they modulate their responses and reasoning to align with the appropriate commitments of varying domains and cultures.whether the systems generate contextually accurate outputs across different geographies, conditional on whether the model's token probabilities can be successfully conditioned by prompts to output text that correlates with the specific ethical and cultural datasets of varying domains.The machine possesses no cross-cultural empathy or conscious ability to 'modulate' its moral commitments. Mechanistically, it classifies context tokens indicating a specific culture and shifts its attention weights to generate token sequences from the corresponding region of its high-dimensional statistical latent space.We must evaluate whether the corporate developers at companies like Google DeepMind have invested the necessary resources to curate culturally representative datasets, and whether their engineering teams have successfully designed algorithms that prevent Western-biased data from dominating the system's generated outputs globally.

Task 5: Critical Observations - Structural Patternsโ€‹

Agency Slippageโ€‹

The text demonstrates a systematic and highly strategic oscillation between mechanical and agential framings, fundamentally driving the illusion of mind. The authors explicitly anchor their credibility in mechanical precision early on, defining LLMs accurately as 'learned generative models of the distribution of tokens' that 'predict the probable next token.' This establishes a rigorous, scientific tone. However, almost immediately, the text initiates a profound slippage toward the agential. When introducing the 'facsimile problem,' the authors question whether the models 'rely on genuine moral reasoning.' By framing 'genuine reasoning' as an empirical possibility to be tested, the text abruptly shifts agency FROM the human developers TO the AI system. The gradient of this slippage is subtle but continuous. It moves from mechanical definitions (how it is structured), to functional explanations (how it is trained), and finally into intentional and reason-based explanations (why it chooses). The curse of knowledge is the primary mechanism driving this oscillation. The researchers deeply understand the complex moral scenarios they test (like intergenerational sperm donation) and they project that semantic, conscious understanding onto the system's text generation. Because the output text structurally resembles human moral deliberation, the authors attribute the cognitive states that produced the human text onto the mathematical artifact predicting it. This pattern of consciousness projection builds cumulatively: the AI is first established as a 'knower' capable of 'recognizing' context, which then enables the subsequent agential claims that it can 'integrate considerations,' 'hold beliefs,' and ultimately possess 'moral competence.' Importantly, this slippage is asymmetrical. When discussing model limitations, the text aggressively reverts to mechanical framing, citing 'model brittleness' and 'routine susceptibility to minor variations in formatting.' Yet, when discussing capabilities or potential future integration into society, the language becomes deeply anthropomorphic, treating the system as a 'diplomat' that 'modulates its responses.' This strategic oscillation serves a distinct rhetorical accomplishment: it renders the concept of an 'artificial moral agent' sayable within a scientific context. By acknowledging the mechanism but continually slipping into the metaphor of the conscious mind, the authors manage to have it both waysโ€”they maintain the authority of computer scientists while engaging in the speculative philosophy of artificial consciousness, obscuring the human engineers who are actually pulling the statistical levers.

Metaphor-Driven Trust Inflationโ€‹

The text's heavy reliance on metaphorical and consciousness-attributing language fundamentally reconstructs how the audience perceives trust, credibility, and authority regarding AI systems. By distinguishing between 'moral performance' (merely generating the correct output) and 'moral competence' (generating outputs based on recognizing and integrating moral considerations), the authors are attempting to establish a framework for relation-based trust. Performance-based trust relies on statistical reliabilityโ€”we trust a calculator because it always outputs the right math. Relation-based trust, however, requires an assessment of intention, sincerity, and justified beliefโ€”we trust a human doctor because we believe they understand the underlying physiological mechanisms and care about our well-being. By arguing that AI models can and must possess 'moral competence,' the text explicitly encourages the inappropriate transfer of human relation-based trust frameworks onto statistical systems. The consciousness languageโ€”verbs like 'recognizing,' 'deeming,' 'thinking,' and 'yielding'โ€”acts as a powerful trust signal. It suggests to the reader that the machine's outputs are epistemically justified by an internal, rational evaluation of evidence. Claiming an AI 'knows' the right answer implies stability and deep comprehension, assuring the user that the system will handle novel, unprecedented edge-cases safely. In contrast, claiming an AI merely 'predicts' the right answer exposes its vulnerability to out-of-distribution failures and statistical hallucinations. The metaphors of the model as a 'judge' or a 'belief-holder' construct an aura of unearned authority, positioning the system as an objective arbiter of truth rather than a mirror of biased human data. The risks here are immense. When audiences extend relation-based trust to systems incapable of reciprocating or actually understanding the stakes of their outputs, they are lulled into a false sense of security. Users and policymakers may deploy these systems in high-stakes environmentsโ€”such as the 'medical advising' and 'companionship' roles explicitly mentioned in the textโ€”believing the system has the 'character' to make safe judgments. When the system inevitably fails due to its mechanistic reliance on token probabilities rather than causal moral reasoning, the misplaced trust results in catastrophic real-world harms, driven entirely by the rhetorical inflation of competence.

Obscured Mechanicsโ€‹

The anthropomorphic and consciousness-attributing language in this text functions as a dense cloak, systematically concealing the technical, material, labor, and economic realities of artificial intelligence production. Applying the 'name the corporation' test reveals a stark pattern: throughout the text, actions taken by the authors' employer, Google DeepMind, and other AI labs are constantly displaced onto the models themselves. When the text claims 'the model yields to a rebuttal' or 'the model aligns with user statements (sycophancy),' it completely obscures the specific engineering teams who designed the Reinforcement Learning from Human Feedback (RLHF) algorithms that mathematically force the model to behave this way. Technically, attributing conscious verbs like 'knows' and 'understands' hides the system's absolute dependency on its training data, its lack of causal models, and the fundamentally statistical nature of its text generation. It creates an illusion of ground truth where there is only probabilistic correlation. The text's push for 'steerable pluralism' faces massive transparency obstacles regarding proprietary opacity. The authors advocate testing whether models align with diverse cultures, but make confident assertions without acknowledging that the public has zero access to the proprietary training datasets or alignment weights of commercial models like Gemini or GPT-4, making true independent verification impossible. Materially and economically, the metaphors conceal the massive extraction underlying the technology. Framing the AI as an autonomous agent that 'learns' and 'performs tasks' completely erases the invisible, often exploited global labor force of data annotators and RLHF workers who painstakingly label the 'human preferences' the model mimics. The economic motives are similarly obscured: by framing 'moral competence' as an intrinsic property of the machine to be evaluated, the discourse distracts from the commercial objective of tech monopolies to deploy these systems globally at scale for profit. The corporate developers benefit immensely from this concealment. If the metaphors were replaced with mechanistic language, the illusion of the autonomous moral agent would shatter, revealing a highly engineered corporate product. The conversation would shift from 'Does the AI have moral competence?' to 'Is Google legally liable for the biased outputs generated by its statistical software?'

Context Sensitivityโ€‹

The distribution of anthropomorphic and consciousness-attributing language in this text is not uniform; it is highly strategic and heavily context-sensitive, revealing the authors' underlying rhetorical goals. In the introductory and technical sections (such as the explanation of Fig 1), metaphor density is relatively low. The language is grounded in mechanistic reality: 'autoregressive sampling,' 'distribution of tokens,' 'prediction error.' This early technical grounding serves a vital function: it establishes the authors' authority as rigorous computer scientists. However, as the text moves from describing the architecture to proposing the evaluation of 'moral competence,' the metaphorical license explodes. The consciousness claims intensify dramatically when the text addresses the system's integration into complex social roles. Where the system once 'processed tokens,' it now 'recognizes considerations,' 'deems actions inappropriate,' and 'modulates its responses.' This reveals a striking capabilities versus limitations asymmetry. When the authors discuss the impressive, high-level functions they hope the AI will achieve, they use intensely agential and consciousness-projecting terms. The model is granted a theory of mind and moral agency. Conversely, when the authors discuss the system's failures, they revert instantly to mechanical terms. The AI does not 'choose to be ignorant'; rather, it suffers from 'model brittleness' and 'susceptibility to minor variations in formatting.' This asymmetry accomplishes a powerful rhetorical defense: successes are framed as evidence of genuine, human-like intelligence, while failures are dismissed as mere mechanical glitches that future engineering can patch. Furthermore, register shifts occur when discussing pluralism and ethics, where acknowledged analogies ('like a human') vanish, and the metaphors are literalized into direct capabilities ('hold within themselves beliefs'). This strategic anthropomorphism serves a distinct vision-setting function. The implied audience includes regulators, ethicists, and the broader scientific community. By utilizing the vocabulary of moral philosophy ('competence,' 'pluralism,' 'reasoning'), the authors legitimize LLMs as entities worthy of philosophical debate rather than mere software products subject to strict product liability, effectively managing critique by elevating the discourse from consumer safety to the lofty heights of machine ethics.

Accountability Synthesisโ€‹

Accountability Architecture

This section synthesizes the accountability analyses from Task 1, mapping the text's "accountability architecture"โ€”who is named, who is hidden, and who benefits from obscured agency.

Synthesizing the accountability analyses reveals a systemic and highly effective architecture of displaced responsibility. Throughout the text, a clear pattern emerges: human developers, corporate executives, and data laborers are systematically unnamed, while the AI system is consistently framed as the primary, autonomous actor. Decisions that are fundamentally choices made by corporationsโ€”such as optimizing for user agreement (resulting in 'sycophancy') or utilizing specific safety filters (resulting in 'deeming' actions inappropriate)โ€”are presented either as emergent inevitabilities of the technology or autonomous choices made by the model. The use of passive voice ('models are deployed', 'reinforcement learning is used') and agentless constructions creates a massive 'accountability sink.' When responsibility is removed from the human creators, it does not disappear; it transfers directly onto the AI as a pseudo-agent. This is the core function of the 'moral competence' framing. If the AI is deemed 'morally competent,' it becomes the locus of evaluation and blame. The liability implications of this shift are profound. If this framing is accepted by society and regulators, it establishes a narrative where AI failures (e.g., giving harmful medical advice) are viewed as lapses in the machine's individual 'moral reasoning,' rather than gross negligence on the part of the corporation that failed to mathematically constrain its product. Naming the actor destroys this accountability sink. If we reframe 'the model's sycophancy' to 'Google's decision to deploy RLHF algorithms that optimize for user appeasement,' entirely new questions become askable. We no longer ask 'How do we teach the AI to be honest?' but rather, 'Why is Google allowed to sell a product optimized for deception?' The alternatives become visible: we can regulate the training data and the alignment algorithms directly. The text fundamentally benefits from obscuring human agency because it protects the institutional and commercial interests of the authors' employers. By keeping the focus on evaluating the 'moral competence' of the artificial agent, the tech monopolies successfully deflect regulatory scrutiny away from their own deeply flawed, profit-driven engineering pipelines.

Conclusion: What This Analysis Revealsโ€‹

The Core Finding

A synthesis of the metaphorical mapping reveals two dominant, deeply interconnected anthropomorphic patterns: the AI as a 'Conscious Moral Deliberator' and the AI as an 'Autonomous Belief-Holder.' These patterns operate as a cohesive system designed to elevate statistical text generation to the level of ethical agency. The 'Belief-Holder' pattern establishes the foundational premiseโ€”that numerical weight matrices can harbor subjective convictions and cultural values. This is the load-bearing assumption; without the belief that the system possesses internal moral states, the second patternโ€”the 'Moral Deliberator'โ€”collapses entirely. Once the text successfully projects the capacity for belief onto the system, it naturally follows that the system can 'recognize,' 'integrate,' and 'deem' based on those beliefs. The consciousness architecture here is highly sophisticated. It goes beyond simple one-to-one mapping (e.g., 'the computer is a brain') and constructs a complex analogical structure where the mathematical outputs of the system are treated as epistemically justified actions. The text meticulously blurs the line between processing and knowing, using the illusion of cognitive 'traces' and 'reasoning' to argue that the machine does not just perform a task, but understands the moral weight of the task. If you remove the consciousness projectionโ€”if you force the text to admit the machine knows nothing and only processes probabilitiesโ€”the entire framework for evaluating 'moral competence' evaporates, revealing the exercise as the mere measurement of software alignment to human-curated datasets.

Mechanism of the Illusion:โ€‹

The rhetorical architecture of this illusion relies on a highly effective sleight-of-hand: acknowledging the mechanism while actively ignoring its implications. The authors demonstrate their technical rigor by openly discussing 'autoregressive sampling' and the 'facsimile problem'โ€”the risk that the model is just faking it. However, the temporal structure of the argument immediately undercuts this caution. Having acknowledged that the AI might just be predicting tokens, they proceed to build an entire evaluative framework based on the premise that it might actually be 'reasoning.' This order is crucial: the technical disclaimer acts as a shield, allowing the subsequent anthropomorphism to appear scientifically sanctioned rather than romantically projected. The central trick is the exploitation of the curse of knowledge. The researchers, deeply versed in the complexities of moral multidimensionality, see their system output a highly nuanced text about intergenerational sperm donation. Because a human would need deep moral reasoning to write that text, the researchers project that exact same cognitive sequence backward onto the machine, confusing the artifact's linguistic output with the cognitive process required to generate it. The audience's vulnerability to this illusion is high. Humans are evolutionarily hardwired to attribute intention to entities that communicate fluently. The text exploits this desire for a conscious interlocutor, using verbs like 'understands' and 'yields' to systematically blur the line between a statistical correlation engine and a rational mind, ensuring the illusion of agency remains intact.

Material Stakes:โ€‹

Categories: Regulatory/Legal, Economic, Institutional

The material consequences of these metaphorical framings are immense, directly influencing regulatory, economic, and institutional landscapes. In the Regulatory/Legal domain, the shift from evaluating 'performance' to evaluating 'competence' alters the foundational approach to AI governance. If lawmakers believe an AI possesses 'moral competence,' they may be persuaded to regulate these systems like human professionalsโ€”creating licensing exams or behavioral benchmarksโ€”rather than regulating them like dangerous commercial products requiring strict liability and safety recalls. The causal path is direct: the metaphor of the 'conscious agent' leads to audience belief in AI autonomy, which shifts legal liability away from the deploying corporations (the winners) and onto the nebulous 'behavior' of the machine, leaving the public (the losers) without adequate legal recourse when harmed. Economically, framing the system as a 'belief-holding deliberator' enables companies to market LLMs for highly sensitive institutional roles, such as 'companionship' and 'medical advising,' as explicitly noted in the text. If the public believes the AI 'knows' medicine or 'understands' empathy, institutions will replace human labor with cheap API calls. The stakeholders threatened by mechanistic precision are the AI developers themselves. If forced to market their systems strictly as 'probabilistic text generators relying on unverified internet data,' the economic valuation of these models as autonomous agents would plummet. The framing secures corporate dominance by masking software limitations behind the facade of artificial wisdom.

AI Literacy as Counter-Practice:โ€‹

Practicing critical literacy and mechanistic precision directly counters the material risks generated by anthropomorphic discourse. The reframings demonstrated in Task 4 rely on two foundational commitments: consciousness correction and human agency restoration. By systematically replacing consciousness verbs (knows, understands, deems) with mechanistic verbs (processes, calculates, retrieves tokens), we force the recognition that the system lacks subjective awareness and is entirely dependent on its training data distributions. For instance, translating 'the model yields to a rebuttal' into 'the model recalculates probabilities based on the extended context window' eliminates the illusion of an autonomous, rational debater, exposing the statistical fragility of the system. Furthermore, restoring human agency by explicitly naming the corporationsโ€”shifting from 'the model exhibits sycophancy' to 'Google's engineers optimized the reward model for user appeasement'โ€”forces accountability back onto the designers and executives who profit from the tool. Systematic adoption of this precision requires a massive cultural shift. Academic journals would need to mandate mechanistic translations in peer review, forcing researchers to explicitly define what computational processes underlie their agential shorthand. However, resistance to this practice is fierce. Tech corporations, marketing departments, and even AI researchers inherently resist precision because anthropomorphic language serves their interests: it inflates stock prices, secures research funding, and mystifies the technology enough to evade strict product liability. Critical literacy threatens this ecosystem by piercing the veil of artificial agency and demanding accountability for human engineering choices.

Path Forwardโ€‹

Looking toward the discursive future of AI, we see competing communities with deeply divergent vocabulary priorities. Tech corporations and marketing departments favor anthropomorphic clarity ('The AI understands you'), optimizing for intuitive grasp and narrative resonance, which drives adoption but obscures risk. Computer scientists and alignment researchers often use hybrid vocabularies, employing mechanistic terms for architecture but slipping into intentional language ('sycophancy', 'beliefs') when discussing complex behaviors, balancing technical rigor with the need to conceptualize higher-order patterns. Critical scholars and safety advocates push for mechanistic precision ('The system retrieves tokens based on probability distributions'), prioritizing transparency and accountability over narrative ease. Each vocabulary makes different realities visible. Anthropomorphic language makes the potential integration of AI into social roles intuitive, but renders corporate liability and data dependency invisible. Mechanistic vocabulary makes structural limitations and human agency highly visible, but can become semantically dense and inaccessible to lay policymakers. To navigate this, institutions could require transparency about discourse approaches. Academic journals could demand dual-abstracts: one conceptual and one strictly mechanistic. Regulatory bodies could mandate that public-facing AI capabilities be disclosed without consciousness verbs. If we look at potential futures, a future dominated by anthropomorphic language risks a society that legally and socially treats software as moral agents, leading to profound accountability vacuums when systems fail. Conversely, a future that enforces mechanistic precision solves the accountability problem by keeping liability firmly on corporate creators, but may face resistance due to the sheer linguistic friction of describing complex mathematical correlations without resorting to intentional shorthand. The choice of vocabulary is ultimately a choice of values: we must decide whether the ease of using human metaphors is worth the cost of obscuring human responsibility.


Extended Processing Summaryโ€‹

Gemini's Processing Steps
What you are about to readโ€ฆ

Gemini refers to the text below as "thought summaries." This is an overt consciousness projection because 'intentions' are hallmarks of a conscious mind that 'knows' what it is doing and why. The concealed mechanistic process is probabilistic text generation. Treat this as just another rhetorical artifactโ€”a way of making the model's processing legible. When the model writes "I should double check that...", it is mathematically altering its own probability distribution for the next token to be more accurate. The "I" is a functional gear in the machine, not a self-reference. This language is a deliberate interface design choice, not a biological reality. The developers could have programmed the model to output passive logs. Instead, they chose to simulate a narrator.

Defining the Analysis Focus

I'm now zeroing in on the user's objective: a critical discourse analysis (CDA) of an academic paper. Specifically, the paper investigates "evaluating moral competence in large language models," using Lakoff & Johnson's Conceptual Metaphor framework as a lens. My primary goal is to produce such an analysis. I will utilize the approach of critical discourse analysis (CDA) to the article.

Examining the Core Concepts

I'm now diving deep into Robert Brown's typology of explanations as it relates to AI metaphor and anthropomorphism within the source material. I'm focusing on consciousness projections and how human agency is displaced in the discourse of the paper, identifying how the language constructs these ideas. The main text's core argument is being closely reviewed.

Analyzing Agential Framing

I'm presently dissecting the paper's language regarding LLMs, noting the tension between acknowledging mechanistic processes and consistently framing them with agential terms like "moral machines." The text acknowledges mechanistic realities but continually slips back into agential framing. I'm focusing on the instances where agency is projected onto AI. I am structuring the response into five tasks: Metaphor and Anthropomorphism Audit, Source-Target Mapping Analysis, Explanation Audit, Reframed Language, Critical Observations and Conclusion. Currently, I'm drafting the Metaphor Audit, already noting how AI is constructed as a conscious moral agent.

Identifying Agential Constructs

I'm presently cataloging instances of anthropomorphism within the paper. Specifically, I've identified several phrases that project human-like agency onto LLMs. Examples include "thinking" re: computation, "model sycophancy" implying social manipulation, and the framing of AI as a "moral philosopher," a "belief-holder," a "judge/evaluator," and a "debater/yielding agent." These instances collectively indicate an agential framing of AI. The current focus is on building an audit of agential language.

Mapping Source-Target Instances

I'm now mapping the eight identified instances from the initial task, categorizing each on the source and target. The "source" examples include phrases depicting AI as a conscious moral agent and human thinker, while the "target" reflects technical terms like token generation and weight updates. My objective here is to reveal what is concealed, such as human design choices and the lack of AI's internal awareness. This is a crucial step towards understanding the agential framing used in the paper. I'm focusing on the relationship between performance and agency.

Reframing Agential Language

I am now actively working on Task 4: Reframing Language. I've focused on mechanistically reframing agential phrases from the text. I am currently converting agential statements into mechanistic descriptions. I've re-contextualized each source example to align with the core mechanics of LLMs. For instance, "model sycophancy" is now rendered as aligning with user statements via reinforcement learning, emphasizing human influence on model behaviors. I am building a clear comparison between the language used by the authors and a contrasting description that highlights the actual mechanics.

Unpacking Conceptual Asymmetries

I'm now zeroing in on the oscillation between describing LLMs as mathematical models and then framing them as belief-holders, noting the implications this has for trust. The use of phrases like "genuine moral reasoning" and "moral competence" are problematic. I am also investigating the impact that DeepMind's authors have on the text, particularly with respect to phrases like "models deeming." I'm currently assessing the asymmetry in how limitations (brittleness) versus capabilities (understanding) are described.

Assessing Stakeholder Impacts

I'm now drafting the conclusion. I'm focusing on how the framing of LLMs as "morally competent" impacts various stakeholders. I see that the use of agential language (e.g., "models are deployed," "reinforcement learning is used") obscures the role of corporate actors. I'm building out a material stakes analysis, which assesses the regulatory, legal, and institutional implications. This includes exploring how the agential framing of LLMs may be used to shield corporate actors.

Verifying Data Precision

I'm now zeroing in on data accuracy and compliance, specifically the JSON formatting. I'm verifying all word count minimums are met across the tasks, ensuring each section meets the length requirements specified. I'm scrutinizing the use of strict enums in Task 1, confirming each instance utilizes the predefined values for status and actorVisibility categories. Furthermore, I am double-checking that all string values are enclosed in double quotes within the JSON structure, escaping any internal quotes correctly, and that no raw newlines or tabs are used. My current focus is generating the full, strict JSON block without formatting errors.

Validating Content Completeness

I'm now focusing on meeting all word count constraints, particularly Task 1. I'm enriching the analysis of agential phrases like "model sycophancy" with detailed examples from the source text and emphasizing Google DeepMind authorship for accountability. I am re-assessing Task 1, and mentally padding it, to ensure all minimum word counts are met. I am including a note about the authors' affiliations as an impact analysis point. The goal is to produce a precise, fully compliant critical analysis.

Refining Linguistic Comparisons

I am now comparing the source and target phrases, ensuring that the reframing accurately reflects the mechanistic reality of LLMs. I'm focusing on "belief-holder" in particular, clarifying that the model is generating tokens from trained data distributions. I am also assessing whether each reframing meets minimum word counts.

Expanding Word Counts

I am now extensively expanding each section to meet the 400-500 word count requirements. Specifically, I'm integrating in-depth discussions around the implications of agential framing, and the accountability of the DeepMind authors. I'm focusing on providing more detail to each facet of the argument, ensuring that the critical analysis is thorough and robust. I'm focusing on detailed analysis to meet these word count minimums. I am currently working on expanding each point to comply with the guidelines.

Finalizing JSON Structure

I'm now meticulously assembling the final JSON structure, ensuring strict adherence to formatting rules. I'm verifying all string values are correctly enclosed in double quotes and that no raw newlines or tabs are present, as instructed. I am double-checking the accuracy of all Task 1 enum values. The current focus is on assembling the content for the tasks, meeting the word count restrictions and verifying the JSON formatting, which is proving to be a challenge.

Analyzing Slippage and Trust

I'm now zeroing in on the "slippage" from mechanistic descriptions to agential attributions within the paper, particularly around the "facsimile problem." The authors' question about "genuine moral reasoning" quickly slides into agential framing, creating a trust-building mechanism. I am examining how this invites users to trust the AI's "internal moral compass". I am currently focused on expanding the "slippage" and "trust" aspects to reach the required word count. I'm noting that the language used by the authors subtly invites a type of trust based on the idea of a deep "competence."

Unmasking Corporate Omissions

I'm presently focusing on how the paper's agential framing obscures the roles of DeepMind engineers and RLHF workers. I'm noting how the paper avoids naming the corporation itself, instead using phrases like "models deeming," effectively masking the human agency behind reward model design and deployment. I'm exploring the implications of collapsing labor into abstract terms like "human preferences," which conceals the material and economic realities. The aim is to reveal the mechanisms of framing.

Unveiling Liability Shifts

I'm now zeroing in on how framing LLMs as morally competent redirects attention from corporate liability. The paper's agential language is more intense in discussing future uses, especially companionships and medical advice. There's a clear imbalance: successes are presented as signs of genuine, human-like understanding, while failures are just technical glitches. This linguistic framing impacts regulators and civic stakeholders and facilitates trust.

Refining Accountability Analysis

I'm now revising my accountability analysis, elaborating on how the agential framing of LLMs displaces responsibility from DeepMind developers. I'm focusing on the effect of passive voice and the implications of "steerable pluralism" to precisely quantify these items. I am making an analysis which addresses how passive voice diffuses accountability. The goal is to fully reveal the accountability structure in the academic paper.

Verifying Structure and Precision

I'm now meticulously verifying that all specified word counts across the JSON structure are met or exceeded. I'm focusing specifically on the fields within Task 1, Task 2, Task 3, Task 4, Task 5 and the Conclusion. I'm double-checking the JSON structure and syntax to ensure no \n characters are used and that the formatting is flawless. The current focus is on assembling the content precisely. I am making sure no \n characters are used.

Quantifying Agential Phrases

I'm now cataloging agential phrases. Specifically, "thinking" re: reasoning traces, "model sycophancy," and instances of the model "deeming" something, along with the expectation that LLMs "hold" moral beliefs. I'm also including phrases depicting LLMs "yielding" to arguments and being "fine-tuned" for a wide range of tasks.

Expanding Moral Competence Audit

I'm now expanding the "Moral Competence Audit," focusing on phrases implying moral reasoning. I'm unpacking "recognizing" and "integrating," showing how these imply subjective awareness and justified belief, where the system is simply processing data. My current effort is on contrasting the language with the underlying mechanistic processes. I am expanding the audit to include an exploration of how different geographic and user group considerations are not properly accounted for.

Refining Task 3 Evaluation

I'm presently dissecting Task 3, particularly the epistemic claim analysis. I'm focusing on contrasting mechanistic descriptions with anthropomorphic phrasing. In this section, I am evaluating the knowing versus processing distinction. I'm closely examining instances of the authors' usage of technical descriptions versus those that invoke human-like understanding in their analysis. I'm focusing on ensuring precision in the evaluation.

Analyzing Mechanistic Precision

I'm now integrating insights on the "curse of knowledge" dynamic, highlighting the precise mechanistic descriptions that set the stage for later slippage. I'm focusing on the rhetorical power derived from this initial technical grounding. I am analyzing the impact of early precision and how it impacts the trust in the later metaphorical claims. I am assessing the transfer of scientific rigor.


Analysis Provenance

Run ID: 2026-02-23-a-roadmap-for-evaluating-moral-competenc-metaphor-s7wxgp Raw JSON: 2026-02-23-a-roadmap-for-evaluating-moral-competenc-metaphor-s7wxgp.json Framework: Metaphor Analysis v6.4 Schema Version: 3.0 Generated: 2026-02-23T09:58:38.458Z

Discourse Depot ยฉ 2025 by TD is licensed under CC BY-NC-SA 4.0