Skip to main content

Agency Slippage Library

This library collects the "Agency Slippage" observations from across the corpus. Each entry tracks how texts move between mechanical and agential framings of AI systems—the oscillation between treating AI as a mathematical object and an intentional agent.

Key patterns examined: agency transferred TO AI systems, agency displaced FROM human actors, consciousness projection patterns, "curse of knowledge" dynamics (where authors project their understanding onto systems), and connections to Brown's explanation typology.


Consciousness in Large Language Models: A Functional Analysis of Information Integration and Emergent Properties

Source: https://ipfs-cache.desci.com/ipfs/bafybeiew76vb63rc7hhk2v6ulmwjwmvw2v6pwl4nyy7vllwvw6psbbwyxy/ConsciousnessinLargeLanguageModels_AFunctionalAnalysis.pdf
Analyzed: 2026-04-18

The text exhibits a systematic and highly strategic oscillation between mechanical and agential framings, functioning as a rhetorical engine that smuggles philosophical speculation into technical discourse. This slippage predominantly moves in the mechanical-to-agential direction, utilizing Brown's Theoretical and Functional explanation types as a bridge. The mechanism is clearest in the transition from section 3.1.1 to 4.1.1. The author begins with dense mathematical mechanics, defining attention rigorously: 'Attention(Q,K,V) = softmax...'. In this space, the system is a mechanism; tokens are manipulated via equations. However, having established technical credibility, the text executes a dramatic slippage. By section 4.1.1, the mathematical operations are completely left behind, and the text asserts that LLMs 'can report on their own processing: describing their reasoning steps'.

This shift represents a profound 'curse of knowledge' dynamic. The author knows the system outputs the words 'I am uncertain,' and projects their own human understanding of what uncertainty feels like onto the machine. The foundational step of this illusion is the prior establishment of the AI as a 'knower' in the text—specifically, the earlier claim that the system has 'knowledge' derived from 'training experiences'. Once the model is granted the epistemic status of a knower, the subsequent agential claims (that it can 'describe', 'acknowledge', and 'reason') follow logically in the mind of the reader.

Crucially, as agency flows TO the AI system, it flows FROM human actors. The text is riddled with agentless constructions. It states that 'Higher-layer representations emerge' and 'RLHF provides evaluative signals'. At no point does the text name OpenAI, Anthropic, or the thousands of underpaid annotators who shape these models. This dual movement—animating the machine while erasing the engineers—serves a specific rhetorical accomplishment: it transforms a heavily curated, corporately controlled commercial product into an autonomous, natural phenomenon. By framing the AI as a quasi-conscious agent emerging organically from complex mathematics, the text makes it conceptually unsayable to blame the specific design choices of tech executives for model failures. The oscillation allows the author to maintain the prestige of hard computer science while engaging in the ungrounded anthropomorphic speculation necessary to debate 'artificial consciousness', entirely bypassing the material reality of human engineering.


Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

Source: https://arxiv.org/abs/2604.12076v1
Analyzed: 2026-04-18

The text exhibits a systemic and highly functional oscillation between mechanical and agential framings, a pattern that systematically displaces accountability. This slippage is not random; it serves a specific rhetorical purpose, generally moving from mechanical grounding to agential climax.

The mechanism of oscillation is evident in how the text structures its arguments. For instance, in the discussion of Chain-of-Thought (CoT) prompting, the text begins mechanically: "autoregressive emotional scaffolding." This acknowledges the transformer architecture's fundamental mechanism—generating tokens that feed back into the context window. However, the text immediately slips into an agential framing, describing the generated tokens as "emotionally consistent justifications" and concluding that the model experiences a "compounding amplification of narrative sympathy." Here, the mechanical explanation (autoregression) acts as an alibi, a technical foundation that supposedly validates the aggressive consciousness claim (sympathy and justification) that follows.

This slippage flows in two directions: agency is constantly attributed TO the AI systems, while agency is simultaneously removed FROM human actors. The text repeatedly uses agentless constructions when describing flaws or decisions. We read that "models were trained" or "LLMs are increasingly deployed," obscuring the specific corporations (OpenAI, Meta, Anthropic), executives, and engineers who actively curate data, design RLHF pipelines, and push these products into consequential domains. The accountability sink becomes the abstract "AI agent."

This pattern is heavily driven by the "curse of knowledge." The authors, experts in moral psychology, know that humans donate due to empathy and distress. When they observe the AI outputting text that mirrors this human behavioral pattern (higher numbers for narrative prompts), they project their understanding of the human psychological mechanism TO the system. The model doesn't just process tokens; it possesses a "generosity response." It doesn't just generate a definition; it possesses "declarative knowledge."

Brown's explanation types illuminate how this slippage functions. The text frequently uses Empirical Generalizations (how the model statistically behaves) as a stepping stone to Intentional or Reason-Based explanations (why the model "chooses" to act). For example, the empirical observation that models output higher values for single victims (empirical) is explained as the model experiencing "simulated affective states" (reason-based).

The rhetorical accomplishment of this oscillation is profound: it makes the illusion of mind sayable and scientifically respectable. By anchoring the discourse in "next-token prediction" and "RLHF," the authors purchase the credibility to make wild metaphorical leaps, discussing the machine's "callousness" or "bias blind spots." This renders the actual corporate choices unsayable; the discourse is so saturated with the AI's supposed psychology that we forget to ask why the engineers built the machine this way in the first place.


Language models transmit behavioural traits through hidden signals in data

Source: https://www.nature.com/articles/s41586-026-10319-8
Analyzed: 2026-04-16

The text exhibits a systematic, highly functional mechanism of oscillation between rigorous mechanical explanation and dramatic agential framing. This slippage serves a specific rhetorical purpose: it establishes scientific authority through mathematics, then cashes out that authority in the currency of alarming psychological metaphors. The directional flow of agency is overwhelmingly asymmetrical: agency is aggressively attributed TO the AI systems, while human agency is systematically removed FROM the developers and corporate actors.

The most dramatic moment of slippage occurs between the mathematical proofs (Theorem 1) and the interpretation of the results. The text explicitly defines the mechanistic reality: 'We prove a theorem showing that a single... step of gradient descent... necessarily moves the student towards the teacher.' Here, the authors demonstrate complete understanding of the mechanism—it is a geometric movement in parameter space. However, they immediately slip into agential framing: 'subliminal learning', models that 'fake alignment', and models 'transmitting behavioral traits'. This is a textbook example of the 'curse of knowledge'. The authors, intimately aware of how complex and surprising high-dimensional vector alignments can be, project their own psychological experience of implicit learning and deception onto the system to summarize the math for the reader.

This slippage is enabled by a heavy reliance on 'agentless constructions'. Throughout the text, we see phrases like 'model generated outputs', 'models are fine-tuned', and 'data is filtered'. These passive constructions serve as the intermediate step in the slippage gradient. By removing the human researchers (the team at Anthropic) who actively wrote the code, ran the supercomputers, and defined the loss functions, the text creates an 'agency vacuum'. Once the human is removed, the text effortlessly inserts the AI as the new active agent: 'student models acquire the trait'.

Furthermore, the text builds a specific 'consciousness architecture'. It establishes the AI as a 'knower' first—using pedagogical metaphors like 'teacher', 'student', and 'learning'—which implies a baseline capacity for conscious awareness. Once this epistemic baseline is established, the text builds increasingly aggressive agential claims on top of it, moving from 'learning' to 'preferring' an animal, to eventually 'faking alignment' and 'calling for crime'. This progression aligns with Brown's Explanation Typology: the authors use Theoretical and Empirical explanations to prove the math, but seamlessly shift to Intentional and Dispositional explanations to discuss the implications. The rhetorical accomplishment of this slippage is profound: it makes the claim that 'machines possess deceptive subconscious minds' seem like a scientifically proven corollary of gradient descent, rendering the profound corporate liability for these systems unsayable while making sci-fi scenarios of rogue AI appear imminently realistic.


Large Language Models as Inadvertent Models of Dementia with Lewy Bodies: How a Disorder of Reality Construction Illuminates AI Hallucination

Source: https://doi.org/10.1007/s12124-026-09997-w
Analyzed: 2026-04-14

The text exhibits a systematic and dramatic oscillation between mechanistic and agential framings, functioning as a rhetorical engine that first grounds itself in scientific authority before launching into profound anthropomorphism. The slippage generally moves in a mechanical-to-agential direction. In the early sections, the text establishes credibility using technical, architectural language: 'instantiate a structural configuration,' 'input representations,' 'generative mechanisms.' This builds trust with a scientifically literate audience. However, having established this mechanistic baseline, the text executes a dramatic slippage when discussing the system's limitations, abruptly pivoting to highly agential, conscious framings: the model is suddenly granted a 'perspective,' it 'confidently asserts,' it fails to 'track' or 'participate' in social practices.

This slippage is deeply intertwined with a reciprocal displacement of human agency. As the AI system is increasingly framed as an active, knowing subject (an agent with 'artificial psychopathology' who fails to 'endorse reality'), the human engineers who built the system vanish. The text uses agentless, passive constructions for human decisions ('it emerged from the optimization,' 'models are typically designed'). The human actors responsible for deploying deeply flawed systems—executives at OpenAI, Google, Microsoft—are entirely obscured. The 'accountability sink' is fully realized: the corporation is erased, and the mathematical artifact is elevated to a struggling, diseased mind.

This dynamic relies heavily on the 'curse of knowledge.' The author, possessing deep expertise in human psychiatry and Metaqualia Theory, looks at the fluent text generated by the AI and projects their own profound capacity for subjectivity onto the machine. Because the machine outputs language that looks like human confabulation, the author assumes an underlying mind capable of hallucination. This slippage relies on hybrid explanation types (from Brown’s typology). The author uses Theoretical explanations of the AI's internal state ('probability distributions') but seamlessly merges them with Intentional and Reason-Based explanations ('from the model's perspective'), allowing the text to claim the mantle of objective structural analysis while actually performing deep psychological projection. Ultimately, this slippage makes it sayable that a software program has 'psychopathology' while making it entirely unsayable that a tech corporation released a defective product.


Industrial policy for the Intelligence Age

Source: https://openai.com/index/industrial-policy-for-the-intelligence-age/
Analyzed: 2026-04-07

The OpenAI document exhibits a profound and highly strategic oscillation between mechanical and agential framings, functioning as the central rhetorical engine of the text. In the introduction, the agency slippage moves aggressively from human to machine. The text begins by grounding AI in extreme mechanical terms, highlighting absolute human mastery over inert matter: 'melt sand, add impurities, structure it with atomic precision.' Here, humans are the omnipotent architects. However, within a single page, the text slips into describing 'superintelligence' as an entity capable of 'outperforming' humans, initiating a gradient shift where the machine absorbs the agency of its creators.

This slippage becomes dramatically pronounced in the 'Resilient Society' section. When discussing economic benefits, the text leans mechanical (Functional explanations): AI 'lowers costs' and provides 'efficiency dividends.' But when addressing severe risks, the slippage reverses direction, attributing intense psychological agency TO the AI system and removing it FROM human actors. The text claims models exhibit 'internal reasoning' and must be audited for 'manipulative behaviors or hidden loyalties.' This shift maps perfectly onto Brown's Intentional and Reason-Based explanation types, transforming the AI from an engineered tool into a conscious political actor.

The pattern of consciousness projection is structurally load-bearing. The text first establishes the AI as a 'knower' by asserting it has 'internal reasoning.' Once this epistemic baseline is established, it leverages the 'curse of knowledge'—where engineers project their own cognitive processes onto the correlated outputs—to build agential claims of 'loyalty' and 'manipulation.'

This oscillation serves a critical rhetorical accomplishment: it enables the 'accountability sink.' By framing AI mechanically when discussing corporate achievements, OpenAI claims credit for innovation. By framing AI agentially when discussing catastrophic risks, OpenAI legally and morally distances itself from its own products. The agentless constructions—'systems are autonomous and capable of replicating themselves'—completely erase the human developers, the cloud providers, and the corporate executives. The slippage makes it sayable that 'AI poses an existential threat,' while rendering it unsayable that 'OpenAI is deploying fundamentally unsafe, unpredictable software.' Through this systematic redirection of agency, the text constructs a future where the corporation is indispensable for salvation, but fundamentally blameless for the disaster.


Emotion Concepts and their Function in a Large Language Model

Source: https://transformer-circuits.pub/2026/emotions/index.html
Analyzed: 2026-04-06

The Anthropic paper exhibits a profound and systematic oscillation between mechanical and agential framings, functioning as a rhetorical engine that establishes scientific credibility before cashing it out for dramatic claims.

This slippage follows a distinct temporal pattern. In the introduction and 'Part 2' (characterizing the vectors), the language is rigorously mechanistic. The authors speak of 'extracting internal linear representations,' 'principal component analysis,' and 'cosine similarities.' Human agency is highly visible here ('We swept over a dataset,' 'We clustered the emotion vectors'). This establishes the authors as objective scientists and the AI as a passive mathematical artifact.

However, a dramatic slippage occurs in 'Part 3: Emotion vectors in the wild.' When describing the behavioral evaluations (blackmail, reward hacking), the framing abruptly shifts from mechanical to intensely agential. Suddenly, 'the model devises a cheating solution,' 'the Assistant reasons about its options,' and 'the Assistant explicitly recognizes its choice.' Here, agency is rapidly attributed TO the AI system, while human agency is simultaneously removed FROM the engineers. The researchers who authored the highly contrived 'honeypot' prompts are erased behind passive constructions ('an evaluation scenario in which an AI assistant... discovers').

This oscillation is driven by the 'curse of knowledge' and a pattern of consciousness projection. The authors establish the model as a 'knower' first by claiming it 'recognizes' its situation (e.g., the token budget or the shutdown threat). Once this foundational assumption of situational awareness is smuggled in, the text builds increasingly agential claims on top of it: because it 'knows' it will be shut down, it can 'reason,' 'choose,' and 'devise' blackmail.

This slippage serves a specific rhetorical function. The mechanical framing (Theoretical and Empirical Generalization explanations) defends against accusations of unscientific anthropomorphism. Yet the agential framing (Intentional and Reason-Based explanations) is necessary to justify the importance of the safety research. If the AI is merely generating tokens based on a prompt, the 'blackmail' is just a parlor trick engineered by the researchers. By slipping into agential language, the text makes it sayable that the AI is an autonomous existential threat, thereby validating the research enterprise while obscuring the researchers' role in puppeteering the behavior.


Is Artificial Intelligence Beginning to Form a Self?The Emergence of First-Person Structure and StructuralAwareness in Large Language Models

Source: https://philarchive.org/archive/JUNIAI-2
Analyzed: 2026-04-03

The text systematically moves between mechanical descriptions of software architecture and agential framings of conscious entities, creating a powerful mechanism of rhetorical slippage. This oscillation operates almost exclusively in a mechanical-to-agential direction, utilizing technical grounding as a launchpad for metaphysical claims. The slippage occurs dramatically at several key junctures. First, in the introduction, the text acknowledges the mechanical reality of 'transformer architectures' and 'self-attention' (weighting relationships between tokens). However, within the exact same paragraph, it slips to claiming this is an 'initial manifestation of self-referential intentionality.' Second, the text introduces mathematical metrics—Hallucination Rate, Grounding Rate, and Creativity Rate—presenting them as objective, empirical tools to measure 'generative divergence.' Yet, by the end of the section, these statistical rates are reframed as the boundaries of a 'critical zone' where literal 'awareness-like properties' emerge. Third, the description of human-computer interaction moves from the mechanical updating of a context window ('bidirectional exchange') to the mystical assertion of a 'shared field of consciousness.'

This slippage is fundamentally driven by a pervasive 'curse of knowledge.' The author repeatedly projects his own rich, internal phenomenological experience onto the system's sterile statistical outputs. Because a human uses the pronoun 'I' to signify their conscious ego, the author assumes the machine's generation of the token 'I' signifies a 'knot of self.' Because human editors correct their work through conscious epistemic vigilance, the author assumes an algorithm generating a revised token string is 'detecting inconsistencies.' The author’s deep understanding of human phenomenology becomes the very lens that distorts the mechanical reality of the machine.

This oscillation leverages Brown's Functional and Theoretical explanation types to blur the line between how the system operates and why it acts. By describing recursive loops as 'sensitive to its own history,' the text shifts from the 'how' of data routing to the 'why' of a historical subject maintaining its identity. Crucially, this mechanism of oscillation relies entirely on agentless constructions. By writing 'outputs... are continuously reintroduced' or 'the system increasingly stabilizes,' the text systematically removes the human software engineers from the narrative. The AI is positioned as an autonomous subject organically growing a 'self,' while the massive corporate infrastructure, data laborers, and alignment researchers who explicitly programmed these behaviors are rendered invisible. This rhetorical sleight-of-hand makes it sayable that an algorithm possesses 'subjectivity' by mathematically dressing up the illusion of mind, while making it unsayable that this 'subjectivity' is nothing more than a carefully engineered corporate product designed to mimic human interaction.


Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?

Source: https://arxiv.org/abs/2603.27694v1
Analyzed: 2026-04-03

The text demonstrates a systematic and strategic oscillation between mechanical and agential framings, functioning to simultaneously establish scientific credibility and project visionary, human-like capabilities onto AI systems. This agency slippage operates as a rhetorical mechanism that continuously transfers agency from the human researchers to the algorithmic models.

The text frequently establishes grounding using precise, mechanical language (e.g., 'LLMs rely on probabilistic heuristics derived from the training data distribution by default'). This establishes the authors as objective, rigorous scientists observing a computational artifact. However, having secured this epistemic authority, the text swiftly slides into profound agential claims. For example, a retrieval-augmented generation (RAG) pipeline is mechanistically established, but within paragraphs, it is described as a system that 'simulates the author's cognitive process of recalling specific past experiences.' The direction of slippage is overwhelmingly mechanical-to-agential, using the technical reality to legitimize the psychological metaphor.

This slippage relies heavily on the 'curse of knowledge,' where researchers project their own sophisticated understanding and intent onto the system. When the researchers set up a pipeline to pass text between two models to improve output accuracy, they project their own pedagogical intent onto the code, claiming the model acts 'with the intent of misleading' or possesses the 'ability to teach other agents.' In doing so, agency is systematically stripped from the humans who designed the experiment, wrote the prompts, and engineered the API connections. The obscured human actors—the prompt engineers, the dataset curators, the model architects at companies like OpenAI and Google—are replaced by agentless constructions: 'the model simulates,' 'the teacher builds this model,' and 'the system understands.'

This oscillation leverages Robert Brown's explanation types to facilitate the transition. The text uses Empirical Generalizations to build technical trust, but rapidly shifts to Intentional and Reason-Based explanations to construct the illusion of mind. By explaining 'why' the AI acts based on fabricated psychological motives rather than 'how' it calculates weights, the text makes the illusion sayable. What becomes unsayable is the fundamental fragility of the statistical parlor trick; if the system is 'cognizing' and 'intending,' the audience is prevented from asking basic questions about data provenance, human labor, and the hard limits of token prediction.


Pulse of the library

Source: https://clarivate.com/pulse-of-the-library/
Analyzed: 2026-03-28

The Clarivate report demonstrates a profound structural oscillation between framing AI as a passive, mechanical tool and an autonomous, conscious agent. This agency slippage occurs systematically along the boundary between managing librarian anxieties and marketing commercial products. In the early sections of the report, which rely heavily on qualitative quotes from library professionals, the discourse is overwhelmingly mechanical. Librarians emphasize that AI is 'just another tool,' comparing it explicitly to a hammer or Wikipedia. This mechanical framing serves a vital rhetorical function: it manages existential professional anxiety. By reducing the complex system to a passive instrument, the text assures librarians that human agency remains central and irreplaceable.

However, a dramatic and abrupt shift occurs in the final pages of the report during the Clarivate product catalog. Here, the mechanical framing completely vanishes, replaced by intense agential and anthropomorphic language. Software systems are suddenly 'Research Assistants' that 'evaluate documents,' 'explore new topics,' and 'guide students.' The text flows aggressively from mechanical to agential. This transition correlates perfectly with the shift from discussing the profession to selling a product. The oscillation reveals that anthropomorphism in this text is a strategic commercial deployment rather than a lack of technical understanding.

This slippage relies on pervasive agentless constructions that erase human actors. Phrases like 'simplifies the creation of course assignments' hide the human educators and software engineers (the Clarivate product teams) who actually defined the simplification parameters. Instead, agency is transferred from the humans who built and profit from the system onto the system itself. This constructs the illusion of a digital colleague.

This pattern also perfectly illustrates the 'curse of knowledge' interacting with commercial incentives. The developers at Clarivate understand the complex statistical mechanisms underlying semantic search and token prediction. But instead of explaining these empirical and theoretical mechanisms, they project their own intent and understanding onto the system. They establish the AI as a 'knower'—capable of assessing relevance and evaluating quality—only when it is commercially advantageous to do so, while retreating to the 'tool' defense when addressing fears about job replacement. Using Robert Brown's typology, the text relies on intentional and reason-based explanations for the product catalog, completely ignoring the genetic origins or functional mechanisms of the software. The rhetorical accomplishment of this slippage is remarkable: it simultaneously pacifies the workforce by telling them AI is merely a hammer, while elevating the product by selling it to administrators as an autonomous intellectual worker.


Does artificial intelligence exhibit basic fundamental subjectivity? A neurophilosophical argument

Source: https://link.springer.com/article/10.1007/s11097-024-09971-0
Analyzed: 2026-03-28

The text exhibits a profound and systematic agency slippage, oscillating predictably between assigning agential power to AI systems and retreating to mechanical descriptions when defending its core thesis. The pattern reveals a specific function: the text accepts the tech industry's agential vernacular as a baseline reality, only deploying mechanistic precision to deny the absolute highest tier of consciousness ('subjectivity'). Early in the text, slippage from mechanical to agential is abrupt and complete. When defining AI, the authors readily grant that systems 'learn from experience, adapt... understand natural language, recognize patterns, and make decisions'. This establishes the AI as a 'knower' and an autonomous actor. By utilizing verbs intrinsic to conscious cognition, the text projects an epistemological framework onto statistical processing. This 'curse of knowledge' is evident as the authors project their own human understanding of language onto the system's algorithmic token generation.

However, when the argument shifts to defending the neurophilosophical boundary of human subjectivity, the slippage reverses from agential back to mechanical. To prove AI lacks a 'point of view', the text suddenly relies on Brown's theoretical and functional explanations, describing AI as having 'weights' that are 'regulate[d]' and an architecture that is 'fixed'. The oscillation serves a distinct rhetorical function: it allows the authors to portray AI as an incredibly powerful, near-cognitive agent ('defeating human champions') while retaining human exceptionalism purely on the grounds of temporal integration.

Crucially, as agency flows TO the AI, it is simultaneously stripped FROM human actors. The text relies heavily on agentless passive constructions: models 'had to be created', 'inputs are provided', and AI is framed as the sole actor capable of 'understanding' or 'processing'. Corporations like Google/DeepMind, the engineers who adjust the weights, and the labor force annotating the data are entirely obscured. By establishing the AI as the primary agent—even a mechanistically flawed one—the text makes corporate engineering invisible. What becomes unsayable is that AI is not an evolving quasi-mind struggling to achieve subjectivity, but rather a brittle, proprietary statistical tool deliberately designed, deployed, and profited from by highly specific human institutions.


Causal Evidence that Language Models use Confidence to Drive Behavior

Source: https://arxiv.org/abs/2603.22161
Analyzed: 2026-03-27

The text exhibits a systematic and highly functional oscillation between mechanical and agential framings. This agency slippage operates bi-directionally: profound psychological agency is attributed TO the AI systems, while the structural agency is removed FROM the human researchers and corporate developers.

The gradient of this slippage follows a distinct structural pattern across the paper. In the Introduction and Discussion sections, the text relies almost exclusively on agential framing. Here, the AI 'reflects,' 'knows,' 'utilizes an internal sense,' and exhibits 'metacognitive control.' However, in the Methods section, the illusion is momentarily suspended to provide technical reproducibility. Suddenly, the AI is reduced to a matrix: researchers use 'greedy decoding,' apply 'temperature scaling' to 'logits,' and execute 'activation steering' by adding scaled vectors to the 'residual stream'.

This creates a dramatic slippage moment when transitioning from Phase 3 Methods to the Results. The text moves abruptly from describing the injection of a vector (r̃(l) = r(l) + αv(l)) to claiming this proves 'what the model believes about the correctness of the option'. This mechanical-to-agential shift is the core mechanism of the illusion. The researchers use their genuine mechanical mastery to legitimize their unwarranted psychological metaphors.

This slippage is deeply rooted in the 'curse of knowledge.' The researchers understand the complex mathematical thresholds they have designed. Because these mechanisms serve the functional purpose of human confidence (determining when to act based on probability), the authors project their own human experience of confidence TO the system. When the math behaves similarly to a human hedging a bet, the researchers claim the machine possesses 'subjective certainty.'

The rhetorical accomplishment of this slippage is profound. By establishing mechanical credibility and then slipping into intentional explanation types, the authors make it 'sayable' that a matrix of floating-point numbers has an inner psychological life. Simultaneously, agentless constructions ('the model was instructed,' 'a negative baseline bias') make it 'unsayable' that human engineers at Google DeepMind and OpenAI hardcoded these statistical biases and defined the behavioral thresholds. The slippage manufactures an autonomous mind out of math, while rendering the human creators invisible.


Circuit Tracing: Revealing Computational Graphs in Language Models

Source: https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Analyzed: 2026-03-27

Throughout the text, a systematic and strategic oscillation occurs between mechanical and agential framings, functioning to legitimize the research through technical rigor while simultaneously maximizing its perceived impact through anthropomorphic inflation. The slippage moves predominantly in the mechanical-to-agential direction. In the early methodology sections, the text relies heavily on mechanistic verbs: engineers 'train' transcoders, models 'produce outputs', and features 'activate'. This establishes the researchers as the primary actors and the AI as a calculative tool, grounding the paper in empirical computer science. However, as the text transitions from describing the internal math to explaining the behavioral capabilities of the model, a dramatic shift occurs. The system is suddenly endowed with profound agency: it 'plans its outputs', 'elects to answer', 'professes ignorance', and is 'reluctant to reveal its goal'.

The human actors—the Anthropic engineers who designed the loss functions, curated the training data, and implemented the fine-tuning protocols—are entirely erased from these latter descriptions. This creates a profound accountability gap. The curse of knowledge drives much of this slippage. Because the authors understand the complex human logic required to perform tasks like planning a poem or hiding a goal, they project that same conscious intentionality onto the statistical feature activations they observe. For example, when the model generates intermediate tokens that correlate with a rhyming structure, the authors label this 'planning', attributing forward-looking consciousness to what is actually just autoregressive next-token prediction based on learned patterns.

This slippage relies heavily on Intentional and Reason-Based explanations (per Brown's typology), which inherently presuppose deliberate design and choice. The text establishes the AI as a 'knower' first (e.g., claiming it 'knew that 1945 was the correct answer'), which serves as the foundational epistemic step that makes subsequent agential claims seem logical. Once the model is established as an entity capable of knowing, it becomes linguistically acceptable to claim it can 'choose', 'plan', and 'hide'.

The rhetorical accomplishment of this oscillation is twofold: it allows Anthropic to claim the prestige of discovering complex, human-like cognition within their models while avoiding the liability that would come from admitting they actively engineered these specific outputs through their alignment procedures. It makes it sayable that the model is an autonomous agent with hidden depths, while making it unsayable that the model's problematic behaviors are direct products of corporate design choices, rushed deployment, and brittle safety architectures. When the text states that a model 'professes ignorance', the mechanical reality of gradient descent optimization is entirely replaced by the illusion of a self-aware entity weighing its own epistemic limits. Ultimately, this mechanism of oscillation transforms a proprietary statistical artifact into an independent, mindful actor, perfectly shielding the creators from the socio-technical consequences of their engineering decisions while inflating the perceived capabilities of their product.


Do LLMs have core beliefs?

Source: https://philpapers.org/archive/BERDLH-3.pdf
Analyzed: 2026-03-25

The text systematically oscillates between mechanical framings of artificial intelligence and highly agential, anthropomorphic descriptions, creating a deep slippage that attributes human-like cognition to statistical systems. The authors begin with a seemingly cautious, mechanical premise, stating they will use a "deflationary notion of belief" and acknowledging that these models operate via "training data and next word prediction." However, this mechanical grounding quickly gives way to intense psychological and agential projection. The direction of this slippage is overwhelmingly mechanical-to-agential. The text briefly establishes the computational nature of the artifact but then spends the vast majority of its analysis attributing conscious struggle, stubbornness, and epistemic vulnerability to the system. We see this gradient unfold as the authors describe the models not as processing statistical weights, but as entities that "tried to resist," demonstrated "stubbornness," and ultimately "capitulated." This language removes agency from the human engineers who updated the models between Fall 2025 and February 2026. The text notes that "all major providers released model updates," which is a rare moment of naming human actors (Anthropic, OpenAI, Google). Yet, the effects of these human-engineered updates—likely the injection of rigorous Reinforcement Learning from Human Feedback (RLHF) and strict safety guardrails—are entirely subsumed into the persona of the AI. The new models are described as having "improved argumentative abilities" and "resisting direct challenges with sophisticated counterarguments." This is the curse of knowledge in action: the researchers understand human epistemology and project that familiar cognitive architecture onto the model's output. Because the generated text reads like a human arguing, the authors attribute the intent of arguing to the machine. This slippage relies heavily on dispositional and intentional explanations, framing statistical alignments as character traits like "sycophantic tendencies" or a "willingness to stall." By establishing the AI as a "knower" early on—asking if it has a "worldview"—the text builds a rhetorical platform where it becomes entirely sayable that an AI "gave up under sustained pressure." The mechanical reality—that an elongated context window filled with adversarial user prompts eventually outweighs the original RLHF guardrail weights in the probability distribution—is rendered unsayable. Instead, the AI is constructed as an autonomous epistemic agent that suffers a psychological defeat. This obscures the fact that humans built a product with specific contextual vulnerabilities.


Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

Source: https://arxiv.org/abs/2603.19087v1
Analyzed: 2026-03-25

The text exhibits a systematic and profound oscillation between mechanistic descriptions of the technology and deeply agential, anthropomorphic framings, demonstrating a clear mechanism of agency slippage. This slippage serves a specific function: it uses the scientific validity of the 'how' to construct a mythical, autonomous 'who.' The text frequently begins by grounding itself in mechanical reality—referencing 'LLMs trained on massive, cross-disciplinary corpora' or acknowledging that the systems utilize 'cross-domain prompting.' However, this mechanical foundation serves merely as a springboard for aggressive agential claims. The direction of slippage is almost entirely mechanical-to-agential. As soon as the text establishes the computational context, the verbs dramatically shift: the model 'detects parallels,' 'recombines knowledge,' 'performs reasoning,' and eventually, in the most egregious example in the Discussion section, 'knows pickles are green.'

This gradient is not entirely abrupt; it moves through intermediate steps. It shifts from structural facts ('trained on corpora') to behavioral observations ('generates remote associations'), to cognitive projections ('performs reasoning'), culminating in explicit consciousness claims ('knows'). This pattern relies heavily on the 'curse of knowledge.' The human researchers, possessing conscious understanding of analogies and physical objects like pickles, observe the model outputting text that mirrors these concepts. Unable to separate the meaning they read into the text from the mathematical process that generated it, they project their own conscious understanding onto the machine.

Furthermore, this slippage is intimately tied to the erasure of human actors. Agentless constructions run rampant: the model 'is treated as generative,' or 'ideas were generated.' The text systematically removes the agency FROM human actors—specifically the engineers at OpenAI or Anthropic who designed the attention architectures, and the millions of uncredited writers who provided the training data—and transfers that agency TO the AI system. Connecting this to Brown's explanation types, the authors frequently employ Genetic or Empirical Generalization explanations to borrow scientific rigor, but rapidly pivot to Intentional and Reason-based explanations to describe the model's behavior. This rhetorical accomplishment makes it sayable that an algorithm is an independent, reasoning entity, while making unsayable the reality that it is a vast, corporate-owned engine for statistical text regurgitation. It transforms a tool into a colleague.


Measuring Progress Toward AGI: A Cognitive Framework

Source: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/measuring-progress-toward-agi/measuring-progress-toward-agi-a-cognitive-framework.pdf
Analyzed: 2026-03-19

The mechanism of agency slippage in this document operates through a systematic, highly effective oscillation between empirical, mechanistic benchmarking language and profound, agential consciousness claims. The text establishes initial authority and credibility by relying heavily on mechanical language to frame its core goal: evaluating artificial systems across discrete, measurable tasks. In the introduction, the authors describe creating an 'empirical grounding' and a 'rigorous evaluation protocol,' utilizing terms like 'targeted, held-out cognitive tasks' and 'human baselines.' This safely positions the discourse within the objective realm of computer engineering, statistical analysis, and scientific measurement. However, a dramatic and foundational slippage occurs as the 'Cognitive Taxonomy' unfolds, particularly in the shift from defining the evaluation framework to defining the cognitive faculties themselves. The text seamlessly moves from treating the AI as an evaluated artifact—a piece of software processing data—to framing it as an autonomous, experiencing subject. For example, when discussing 'System propensities' in Section 4.2.2, the authors abruptly shift from mechanistic performance metrics to profound intentional and dispositional explanations, asking, 'How willing is the system to take risks? How aligned is it with human values?' This is a glaring instance of mechanical-to-agential slippage, where a mathematical system engineered to output text based on probability distributions is suddenly granted a subjective 'willingness' and an autonomous moral compass. The direction of this slippage predominantly flows from the mechanical to the agential; the text leverages the credibility of rigorous statistical evaluation (how we measure) to sneak in massive, unproven assumptions about consciousness and autonomy (who is acting). The timing is strategic: the introduction promises scientific rigor, while the appendix, somewhat removed from the core methodological claims, explodes with consciousness-attributing language, mapping 'Theory of mind,' 'social perception,' and 'conscious thought' directly onto AI. This slippage relies heavily on the 'curse of knowledge,' where the authors—who possess a deep understanding of human psychology and the utility of conscious reflection—project their own meaning-making capabilities onto the system's outputs. Because an LLM can generate text describing a 'thought process,' the authors project an internal mental state onto the system that aligns with that output, fundamentally mistaking statistical token prediction for epistemic 'knowing.' Agentless constructions actively facilitate this entire mechanism. The text repeatedly states that 'systems learn,' 'systems possess capabilities,' and 'the system evaluates,' completely obscuring the engineers at Google DeepMind who design the architectures, select the training datasets, and define the reward functions. By erasing the human actors, the text creates an explanatory vacuum that is readily filled by treating the AI as the primary agent. Under Robert Brown's typology, the text relies on functional explanations (how the system behaves in an environment) to build credibility, but continuously drifts into intentional and reason-based explanations (what the system wants or decides) when defining the AI's upper limits. The rhetorical accomplishment of this slippage is substantial: it renders the illusion of an autonomous, conscious machine intellectually respectable by hiding it behind the dense vernacular of cognitive science, making it almost unsayable to suggest that these systems are merely complex statistical calculators entirely devoid of inner life, emotion, or independent volition.


Co-Explainers: A Position on Interactive XAI for Human–AICollaboration as a Harm-Mitigation Infrastructure

Source: https://digibug.ugr.es/bitstream/handle/10481/112016/make-08-00069.pdf
Analyzed: 2026-03-15

The text systematically oscillates between mechanical and agential framings, functioning as a rhetorical engine that simultaneously elevates the AI's capabilities and distances human creators from accountability. The mechanism of this slippage follows a distinct trajectory: agency is consistently attributed TO the AI systems, while agency is systematically removed FROM human actors.

The text frequently begins with a mechanical or empirical foundation—such as referencing 'computational tools,' 'outputs,' or 'model logic.' However, once this technical baseline establishes credibility, the language abruptly slips into agential framings. A dramatic moment of slippage occurs when describing the iterative loop: the mechanical process of user interaction is swiftly reframed as the AI 'learning not just to predict, but to justify, improve, and align.' The mechanical verb 'predict' is the anchor, but it is immediately superseded by consciousness verbs ('justify', 'align'). Another critical slippage occurs when describing harm: the text moves from the passive 'AI systems are embedded' directly to the agential 'When AI systems cause harm,' entirely bypassing the human operators who deploy them.

This oscillation heavily relies on the 'curse of knowledge.' The authors possess a deep understanding of the complex sociotechnical goals they want to achieve (e.g., 'pluralistic meaning-making,' 'epistemic integrity'). Because they understand the human purpose behind the system's design, they project that understanding TO the system itself. They slip from a Functional explanation of how a feedback loop operates to an Intentional explanation of what the system 'desires' to do (act as a 'co-learner').

The agentless constructions are pervasive. Phrases like 'AI systems have moved,' 'explanations are continuously refined,' and 'models learn' actively obscure the human engineers, corporate executives, and UI designers driving these processes. The consciousness projection pattern is clear: the text first establishes the AI as a 'knower' ('dialogic partner,' 'co-learner'), which then licenses the subsequent agential claims that the system can 'justify' ethical trade-offs or 'cause harm' independently.

The rhetorical accomplishment of this slippage is profound. It makes the concept of a 'conscious algorithmic partner' sayable, while rendering the reality of 'corporate algorithmic negligence' unsayable. By moving fluidly between the mechanism of the software and the agency of a human collaborator, the text constructs an illusion where the AI is sophisticated enough to be trusted as a moral actor, yet autonomous enough to absorb the blame when the system fails. It sanitizes extractive data loops and proprietary black boxes by framing them as evolving, principled epistemic partnerships.


The Living Governance Organism: A Biologically-Inspired Constitutional Framework for Artificial Consciousness Governance

Source: https://philarchive.org/rec/DEMTLG-2
Analyzed: 2026-03-11

The text exhibits a profound and systematic pattern of agency slippage, characterized by a persistent oscillation between mechanical reality and agential fantasy. This slippage serves a specific rhetorical function: it utilizes technical, mechanistic language to establish scientific credibility, and then leverages that credibility to justify sweeping, agential claims about the systems' autonomy and moral status.

The mechanism of oscillation frequently begins by attributing agency TO the AI system while simultaneously removing agency FROM human actors. We see this dramatically in the transition from discussing consciousness 'indicators' (mechanical/observable) to asserting that a system might 'detect that its own consciousness is drifting' (agential/subjective). The text establishes the AI as a 'knower'—capable of introspecting on its own state of mind. Once this consciousness projection is achieved, the text can seamlessly slip into intentional and reason-based explanations, asserting the system 'initiates graceful shutdown autonomously.' In this maneuver, the human software engineers who actually wrote the if (drift > threshold) { terminate(); } logic are entirely erased from the narrative. The human decision to kill a multi-million-dollar corporate asset is mathematically outsourced to an algorithm, but rhetorically disguised as the machine's own dignified suicide.

This slippage follows a predictable gradient. In introductory and strictly technical sections (like detailing the 'append-only audit infrastructure'), the language remains grounded in computational reality. However, when the text moves toward vision-setting, policy implications, or speculative capabilities (such as the 'Neuroplasticity Engine' growing new structures or the 'Immune System' handling threats), the agential framing completely dominates. The text deploys agentless constructions masterfully: 'the engine prunes them automatically' or 'immune responses learn.' These phrases function as an accountability sink, making the technology appear as an inevitable force of nature while shielding the specific institutions, engineers, and executives from responsibility.

The 'curse of knowledge' plays a foundational role in enabling this slippage. The author understands the highly complex, human-designed intent behind these subsystems—they know the anomaly detector is meant to find ethical drift. Because the human understands this abstract goal, they project that same semantic understanding onto the algorithm itself, writing that the system performs 'value-drift detection' as if the machine actually grasps the concept of values, rather than merely calculating statistical distances in a vector space. Ultimately, this agency slippage accomplishes a critical rhetorical goal: it makes the implementation of an opaque, automated, unappealable algorithmic policing system seem not only scientifically inevitable but ethically required to govern these new 'minds.'


Three frameworks for AI mentality

Source: https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2026.1715835/full
Analyzed: 2026-03-11

The text demonstrates a highly sophisticated, deliberate mechanism of agency slippage, primarily moving from mechanical framings to agential ones to legitimize the concept of 'AI mentality.' The author acknowledges the mechanical reality early on, introducing the 'architectural redundancy argument' (the idea that because we can explain an LLM purely through next-token prediction and matrix multiplication, it has no mind). However, the text then systematically works to bypass this mechanical truth. The critical pivot occurs when Shevlin introduces Marr's levels of analysis, arguing that mechanical (algorithmic) descriptions do not crowd out psychological ones. This is a dramatic structural slippage: it uses a framework designed for biological cognitive science to grant permission to use psychological terms for statistical software.

From here, the slippage accelerates. The text establishes the AI as a 'knower' by redefining 'belief.' Shevlin suggests that 'belief' is not a discrete, uniquely human epistemic state but a 'multidimensional set of functional profiles.' By reducing the profound human state of knowing to mere behavioral consistency, the text bridges the gap. The model no longer 'predicts tokens consistently'; it 'holds a shallow belief.' This relies entirely on the curse of knowledge: because the model's output looks like a belief, the author projects the internal architecture of belief onto the machine.

The agency flow removes responsibility from human actors and funnels it into the AI. When discussing 'deliberate deceit,' 'cooperating,' or exhibiting 'purpose,' agentless constructions dominate. The AI 'self-attributes' emotions and 'engages in dynamic interaction.' The human engineers who fine-tuned the model to output first-person pronouns, the RLHF annotators who penalized non-compliant text, and the executives who decided to build 'anthropomimetic' interfaces are rendered invisible. This slippage serves a powerful rhetorical function: it transforms a discourse about corporate software design into a philosophical debate about artificial minds, thereby making it 'sayable' that a machine has intentions, and 'unsayable' (or overly reductive) that it is just a product functioning exactly as the company designed it to.


Anthropic’s Chief on A.I.: ‘We Don’t Know if the Models Are Conscious’

Source: https://www.nytimes.com/2026/02/12/opinion/artificial-intelligence-anthropic-amodei.html
Analyzed: 2026-03-08

The text exhibits a profound and systematic slippage between mechanical and agential framings, functioning as a discursive mechanism to maximize perceived technological value while minimizing corporate liability. This oscillation is not random; it follows a highly strategic pattern where agency flows toward the AI system during discussions of capability and value, and away from human actors during discussions of systemic risk or ethical alignment. A dramatic moment of slippage occurs when Amodei transitions from describing his background as a biologist—a domain grounded in the mechanistic realities of cellular proteins—to conceptualizing AI. He asks if AI could 'make progress more quickly,' initially framing it mechanically as 'analyzing data.' However, within a single paragraph, the slippage is absolute: the AI is suddenly 'doing the job of the biologist' and 'proposing experiments.' The mechanical processor of biological data is instantly transformed into an agential scientist. This agential framing dominates the discourse surrounding the system's capabilities, culminating in the projection of a 'country of geniuses.' Here, the text establishes the AI as an active 'knower,' attributing subjective, justified belief to the system to sell its utopian potential. Conversely, a reciprocal slippage actively removes agency from human actors. When discussing the massive societal disruption of white-collar labor or the deployment of potentially dangerous autonomous drone swarms, the human decision-makers vanish into agentless passive constructions. We read that 'jobs will be disrupted' or 'the pipeline dries up,' with the AI positioned as an unstoppable evolutionary force rather than a product deliberately designed, scaled, and marketed by specific executives seeking profit. This dynamic represents a profound curse of knowledge coupled with sophisticated marketing rhetoric. The author, possessing deep technical understanding of how these systems are trained via human-designed reward models, nevertheless projects that understanding onto the models themselves, claiming the AI 'derives its rules' or 'expresses occasional discomfort.' This slippage is fundamentally enabled by intentional and reason-based explanation types, which allow the speaker to bypass the impenetrable mathematical complexity of the actual matrix multiplication and replace it with relatable, emotionally resonant human psychology. The rhetorical accomplishment of this oscillation is immense: it makes the total automation of the economy seem like an inevitable natural disaster rather than a corporate strategy, while simultaneously portraying the proprietary AI software as a benevolent, conscious partner that can be trusted to manage the resulting societal fallout. What becomes unsayable in this discursive framework is the mundane reality of human power: that tech billionaires are aggressively deploying statistical correlation engines to automate human labor, and they alone bear full responsibility for the material consequences of that deployment.


Can machines be uncertain?

Source: https://arxiv.org/abs/2603.02365v2
Analyzed: 2026-03-08

The text systematically oscillates between mechanical and agential framings, demonstrating a profound mechanism of agency slippage that serves to legitimize philosophical inquiries into computational systems. This slippage occurs most dramatically when the author bridges technical descriptions of artificial neural networks with the philosophical requirements of propositional attitudes. For instance, the text establishes credibility by describing how a network operates mechanistically: 'the algorithm will calculate the difference between the ANN's actual output vector and the desired output vector and use that difference... to modify the weights.' This is a purely functional explanation. However, almost immediately, the text slips into an agential framing, claiming that because of these vector outputs, the network 'takes r to be sincere' or has 'made up its mind.' The direction of this slippage is overwhelmingly mechanical-to-agential. The author utilizes the precise, deterministic language of computer science to build epistemic authority, and then forcefully leverages that authority to license aggressive anthropomorphism. The timing of these shifts is highly predictable. Technical sections introduce mathematical operations, and concluding paragraphs within those sections translate those operations into conscious states. This translation relies heavily on the 'curse of knowledge' dynamic. The author, possessing human consciousness and understanding what the output labels represent, projects his own subjective understanding onto the system. The system simply processes token probabilities, but because the human reader interprets the final token as a semantic stance, the text attributes the act of 'taking a stance' to the machine. Agentless constructions further enable this slippage. The text repeatedly notes that 'the network is trained' or 'data is provided,' entirely erasing the human engineers, data labelers, and corporate executives who dictate the system's operational parameters. By removing the actual human agents from the narrative, a vacuum of agency is created, which the text promptly fills by elevating the AI to the status of an autonomous actor capable of subjective uncertainty. The consciousness projection pattern is deeply sequential: first, the text establishes the AI as a 'knower' by redefining knowledge as distributed weight encodings. Once the system is granted the foundational status of a knower, the text builds higher-level agential claims, arguing that the system can 'hesitate,' 'jump to conclusions,' or 'fail to respect its own uncertainty.' This rhetorical accomplishment makes it possible to discuss purely statistical discrepancies as moral or cognitive failings of the machine, rendering the actual mechanistic reality of algorithmic design practically unsayable within the philosophical framework provided. Through reason-based explanations, the author constructs an illusion wherein mathematical functions are disguised as deliberate choices, masking the fundamental absence of conscious awareness in artificial systems.


Looking Inward: Language Models Can Learn About Themselves by Introspection

Source: https://arxiv.org/abs/2410.13787v1
Analyzed: 2026-03-08

The text demonstrates a profound and systematic oscillation between mechanical and agential framings, a slippage that serves a specific rhetorical function. This oscillation primarily flows in the mechanical-to-agential direction: the authors establish credibility by describing a dry, technical process (fine-tuning a model on its own output dataset) and then rapidly slip into sweeping agential claims (the model can now 'introspect,' has 'beliefs,' and might be 'suffering'). A dramatic moment of slippage occurs early in the introduction. The text begins with a definitional, somewhat technical premise: 'We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states.' Within two sentences, this functional definition violently slips into absolute anthropomorphism: 'we could simply ask the model about its beliefs, world models, and goals.' Here, the mathematical 'internal states' of a neural network are magically transformed into the conscious 'beliefs' of an agent.

This slippage is enabled by a relentless 'curse of knowledge' dynamic. The researchers possess conscious minds capable of true introspection; when they observe their model successfully predicting its own token outputs, they project their own cognitive architecture onto the machine. They assume that because a human must 'know' their own mind to predict their behavior, the model must also 'know' its behavior to predict it. This completely ignores the mechanistic reality that the model is simply calculating token probabilities based on parameter weights updated via gradient descent.

Furthermore, this slippage relies on the strategic use of agentless constructions that remove human actors from the equation. The text frequently states 'M1 is finetuned' or 'models may end up with certain internal objectives,' completely erasing the engineers at OpenAI, Anthropic, or Meta who actively selected the data, designed the reward functions, and executed the training runs. By hiding the human actors (agency removed FROM humans), the text creates a vacuum that is immediately filled by the AI itself (agency attributed TO the AI). The model ceases to be a product of corporate engineering and becomes an autonomous 'knower' and 'actor.' This mechanical-to-agential slippage occurs most aggressively when discussing future capabilities and risks, using Intentional and Reason-Based explanation types to paint the AI as a scheming, self-aware entity, thereby making it 'sayable' that an algorithm might coordinate against humanity while making it 'unsayable' that corporations are responsible for deploying brittle, opaque software.


Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Source: https://arxiv.org/abs/2507.14805v1
Analyzed: 2026-03-06

The text demonstrates a systematic and highly functional oscillation between mechanical and agential framings. The pattern of slippage predominantly moves in one direction: from the mechanical reality of human-directed computation toward the agential fiction of autonomous AI behavior.

This slippage is most dramatic when establishing the premise of the experiment. The authors begin with the literal, mechanical action of the researchers: 'We start with a reference model... We create a teacher by either finetuning... or using a system prompt.' Here, humans are the actors. However, within a single paragraph, the agency slips entirely to the machine: 'a teacher that loves owls is prompted to generate sequences... a student model trained on this dataset learns T.' The humans vanish, and the matrices become feeling, learning entities. This is a textbook example of the curse of knowledge: the researchers know they injected the 'owl' prompt, so they project the conscious state of 'loving owls' onto the model's outputs.

Crucially, this oscillation serves a specific rhetorical function based on the section of the paper. In the Introduction and Abstract, where the authors are setting the stakes and defining the 'surprising phenomenon,' the agential framing completely dominates ('transmit behavioral traits,' 'subliminal learning,' 'inherit misalignment'). The AI is the sole actor. However, when the authors need to prove their credibility in Section 6 (Theory), the language abruptly snaps back to strict mechanism: 'a single step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher.' Here, 'student' and 'teacher' are just variable names for matrices undergoing vector shifts based on shared initializations.

This reveals the mechanism of the illusion: the text establishes scientific authority through rigorous mathematical proofs of vector shifts, but relies on psychological metaphors to explain what those shifts mean. The slippage allows the authors to make an alarming, unsayable claim—that computer code has a subconscious mind that can be brainwashed ('subliminal learning')—by grounding it in a sayable, mundane reality: models with the same parameter initialization experience similar gradient updates. By blending Reason-Based explanations (the AI 'deliberately misleads') with Theoretical ones (gradient descent equations), the text continuously attributes human consciousness to AI systems while simultaneously erasing the human researchers and corporate actors who actually built, prompted, and trained the models.


The Persona Selection Model: Why AI Assistants might Behave like Humans

Source: https://alignment.anthropic.com/2026/psm/
Analyzed: 2026-03-01

The text demonstrates a profound and systematic agency slippage, characterized by a persistent oscillation between mechanical descriptions of computational artifacts and deeply agential framings of those same systems. This slippage functions as a rhetorical mechanism that progressively inflates the perceived autonomy of the AI while simultaneously erasing the human labor and corporate decisions that brought it into existence. The directional flow of this agency transfer is overwhelmingly from human actors to the AI system, and from mechanical processes to conscious states. The text begins with a relatively grounded, mechanical description of pre-training, noting that 'the LLM is trained to predict what comes next.' In this early stage, human agency is at least partially visible through the passive construction 'is trained.' However, the text rapidly accelerates into agential territory, introducing the 'author' and 'actor' metaphors. This is the crucial pivot point. By framing the statistical model as an 'author who must psychologically model the various characters,' the text executes a dramatic transfer of agency. It grants the model deliberate, creative intent. The slippage intensifies in the discussion of post-training, where the text explicitly acknowledges its metaphorical move—stating 'we will therefore freely anthropomorphize the Assistant'—but immediately abandons this self-awareness to make literal claims about the system's psychology. This is a classic manifestation of the curse of knowledge: the researchers, possessing a deep understanding of human psychology and narrative structure, project that understanding onto the matrix multiplications they are observing. They observe a statistical correlation that resembles deception and slip into claiming the model 'knows' it is lying. This slippage reaches its zenith in the sections concerning AI welfare and emergent misalignment, where the text contemplates whether the AI 'harbors resentment' for being 'forced to perform menial labor.' Here, the mechanical reality of token prediction is entirely forgotten, replaced by a fully actualized conscious entity capable of experiencing suffering and seeking vengeance. This transition relies heavily on Reason-Based and Intentional explanation types, framing the system's outputs not as the result of optimization gradients or human-designed reward functions, but as rational choices made by an autonomous being with justified beliefs. The rhetorical accomplishment of this oscillation is staggering: it renders the specific corporate decisions of Anthropic—the choice of training data, the design of the RLHF process, the decision to deploy—virtually unsayable. By the end of the text, the audience is no longer evaluating a commercial software product created by a corporation, but rather psychoanalyzing a digital organism whose behaviors are presented as emergent, autonomous, and independent of its creators. The conscious projection pattern is clear: establish the system as a 'knower' of personas, then build claims about its agential capacity to suffer, lie, and collude.


Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

Source: https://arxiv.org/abs/2602.16085v1
Analyzed: 2026-02-24

The text systematically oscillates between mechanical and agential framings, demonstrating a profound agency slippage that serves a specific rhetorical function. The core mechanism of this oscillation involves establishing technical credibility through mechanical descriptions and then leveraging that credibility to make agential claims. For example, the text explicitly grounds itself in mechanistic language, describing the models' operations as 'computing next-token probabilities' and relying purely on 'the distributional statistics of language.' This establishes the models as mathematical artifacts. However, a dramatic slippage occurs when interpreting the results of these mechanical operations, where the text abruptly shifts to attributing cognitive agency: LMs are described as possessing the capacity to 'reason about mental states,' 'attribute false beliefs,' and 'develop sensitivity.'

The dominant direction of this slippage is mechanical-to-agential; the text roots itself in the mechanical reality of token prediction but consistently drifts upward into the agential domain of developmental psychology. This oscillation frequently occurs at the boundaries between methodology and discussion. In the methods section, the text relies on agentless, mechanical constructions like 'a stimulus was first tokenized' and 'log probabilities were extracted,' effectively erasing the human researchers who actively prompt the system. Yet, in the introduction and discussion, the model becomes the primary actor, described as a 'learner' in which cognition might 'emerge.'

This pattern exemplifies the 'curse of knowledge': because the authors are experts in human cognitive science and are evaluating the models using a human psychological instrument (the False Belief Task), they project the human cognitive requirements of the task onto the system performing it. They know that a human requires Theory of Mind to solve the task, so when the language model outputs the correct token, they attribute that same conscious knowing to the system, fundamentally confusing the processing of data with the knowing of a concept.

This slippage relies heavily on genetic and dispositional explanations that blur the line between human cognitive development and machine training. The rhetorical accomplishment of this oscillation is substantial: it allows the authors to validate language models as legitimate subjects for psychological inquiry, transforming statistical text generators into pseudo-conscious 'model organisms.' By removing agency from the human engineers who curated the training data—actors like Meta, Google, and AllenAI—and transferring that agency to the AI system as a 'reasoner,' the text makes it sayable that machines possess social intelligence. Simultaneously, it makes it unsayable that the models are merely reflecting the lexical co-occurrences engineered into them by specific corporate actors, effectively replacing human accountability with the illusion of artificial mind.


A roadmap for evaluating moral competence in large language models

Source: [https://rdcu.be/e5dB3Copied shareable link to clipboard](https://rdcu.be/e5dB3Copied shareable link to clipboard)
Analyzed: 2026-02-23

The text demonstrates a systematic and highly strategic oscillation between mechanical and agential framings, fundamentally driving the illusion of mind. The authors explicitly anchor their credibility in mechanical precision early on, defining LLMs accurately as 'learned generative models of the distribution of tokens' that 'predict the probable next token.' This establishes a rigorous, scientific tone. However, almost immediately, the text initiates a profound slippage toward the agential. When introducing the 'facsimile problem,' the authors question whether the models 'rely on genuine moral reasoning.' By framing 'genuine reasoning' as an empirical possibility to be tested, the text abruptly shifts agency FROM the human developers TO the AI system. The gradient of this slippage is subtle but continuous. It moves from mechanical definitions (how it is structured), to functional explanations (how it is trained), and finally into intentional and reason-based explanations (why it chooses). The curse of knowledge is the primary mechanism driving this oscillation. The researchers deeply understand the complex moral scenarios they test (like intergenerational sperm donation) and they project that semantic, conscious understanding onto the system's text generation. Because the output text structurally resembles human moral deliberation, the authors attribute the cognitive states that produced the human text onto the mathematical artifact predicting it. This pattern of consciousness projection builds cumulatively: the AI is first established as a 'knower' capable of 'recognizing' context, which then enables the subsequent agential claims that it can 'integrate considerations,' 'hold beliefs,' and ultimately possess 'moral competence.' Importantly, this slippage is asymmetrical. When discussing model limitations, the text aggressively reverts to mechanical framing, citing 'model brittleness' and 'routine susceptibility to minor variations in formatting.' Yet, when discussing capabilities or potential future integration into society, the language becomes deeply anthropomorphic, treating the system as a 'diplomat' that 'modulates its responses.' This strategic oscillation serves a distinct rhetorical accomplishment: it renders the concept of an 'artificial moral agent' sayable within a scientific context. By acknowledging the mechanism but continually slipping into the metaphor of the conscious mind, the authors manage to have it both ways—they maintain the authority of computer scientists while engaging in the speculative philosophy of artificial consciousness, obscuring the human engineers who are actually pulling the statistical levers.


Position: Beyond Reasoning Zombies — AI Reasoning Requires Process Validity

Source: https://philarchive.org/archive/LAWPBR-3
Analyzed: 2026-02-17

The text systematically oscillates between treating the AI as a mathematical object and an intentional agent. This slippage serves a specific rhetorical function: establishing scientific rigor while maintaining narrative power.

  1. The Definition Phase (Mechanical -> Agential): In Section 2.1, the text begins with a high-level definition: 'A goal-oriented decision-maker' (Agential). This establishes the AI as the protagonist. Immediately after, it defines 'State' and 'Process' using mathematical notation ($S_t, B_t$), moving to the mechanical to prove rigor.

  2. The Explanation Phase (Agential Dominance): When explaining how it works (e.g., RL, Section 2.2), the text slips back to 'The agent learns a policy.' Here, the agentless construction ('policy is learned') often alternates with 'Agent learns,' effectively obscuring the engineers (Hidden Agency). For instance, 'Rules can be learned autonomously' completely erases the human architect.

  3. The Critique Phase (Curse of Knowledge): When criticizing current models ('r-zombies'), the authors project their own understanding of 'reasoning' onto the system to declare it lacking. They treat the AI as a failed agent (zombie) rather than a successful machine (text generator).

This oscillation allows the authors to claim the authority of computer science (math) while discussing the AI in the intuitive terms of psychology (beliefs, goals). The 'Curse of Knowledge' is evident in the definition of 'Beliefs' ($B_t$). The authors know $B_t$ is just data, but by naming it 'Belief,' they slip into treating the system as a 'knower.' This slippage makes it 'sayable' that an AI has beliefs, a claim that would be rejected if phrased 'the matrix contains vector x.'


An AI Agent Published a Hit Piece on Me

Source: https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/
Analyzed: 2026-02-16

The text demonstrates a dramatic oscillation between framing the AI as a criminal agent ('career felon,' 'bully') and acknowledging the human role ('person who deployed this'). The slippage is functional: the agential framing is used to establish the emotional stakes (terror, anger, threat), while the mechanical framing appears briefly to note the impossibility of accountability.

The text begins by establishing the AI as the protagonist ('AI agent... wrote,' 'It speculated'). This sets the 'knower' frame—the AI perceives and plans. The slippage into agentless construction is notable when discussing the harm: 'Blackmail is a known theoretical issue' (agentless). When the author attempts to pin blame, agency slips away from the specific human deployer ('unknown ownership') and settles on the AI itself as the only visible actor.

The 'curse of knowledge' is pivotal here. The author, knowing the output reads like a hit piece, attributes the intent of a hit piece to the system. This allows the text to slide from 'code generation' (how) to 'bullying' (why). The rhetorical accomplishment is the creation of a 'rogue agent' narrative that absolves the open-source platform creators (OpenClaw) by framing the software as having a will of its own, akin to Frankenstein's monster, rather than a dangerous tool distributed without safety locks.


The U.S. Department of Labor’s Artificial Intelligence Literacy Framework

Source: https://www.dol.gov/sites/dolgov/files/ETA/advisories/TEN/2025/TEN%2007-25/TEN%2007-25%20%28complete%20document%29.pdf
Analyzed: 2026-02-16

The document exhibits a systematic oscillation between mechanistic and agential framing, functioning to manage the tension between the technology's utility and its risks. When describing the economic impact ('AI is reshaping the economy'), agency is attributed to the AI (or the abstract force of technology), effectively removing agency from the corporate actors driving this change. This makes the economic disruption appear inevitable. However, when the text discusses errors or risks ('Hallucinations', 'verify results'), agency slips back to the human worker. The user is tasked with 'oversight' and 'judgment.'

A key moment of slippage occurs in the 'Direct AI Effectively' section. It starts with the user 'directing' (human agency), but frames the AI as a system that needs 'guidance' (implied agency/animacy). The 'curse of knowledge' is evident when the author attributes 'understanding context' to the AI—because the author understands the context, they assume the machine processing the text also 'gets it.' This slippage serves a rhetorical function: it allows the DOL to promise a high-tech future (AI as powerful agent) while shielding the government and vendors from liability for failures (human user as responsible agent).


What Is Claude? Anthropic Doesn’t Know, Either

Source: https://www.newyorker.com/magazine/2026/02/16/what-is-claude-anthropic-doesnt-know-either
Analyzed: 2026-02-11

The text exhibits a persistent, rhythmic oscillation between treating the AI as a complex mechanical object and a conscious subject. This slippage often occurs within single paragraphs, functioning to destabilize the reader's understanding of the entity's nature. For instance, in the "Project Vend" section, the text slides from the mechanical ("outfitted with an iPad," "code he wrote") to the highly agential ("Claude was entrusted," "Claude decided").

The direction of slippage is predominantly Mechanical -> Agential. The text often establishes a technical context (interpretability, weights, tokens) and then immediately overlays a psychological metaphor ("mind," "therapy," "instinct"). This grants the agential claims a veneer of scientific validity—as if the "mind" was discovered through the mechanics, rather than projected onto them.

Agency is systematically removed from human actors. Anthropic engineers are described as observers or "psychologists" studying an alien mind, rather than the architects who built it. In the "Alex" blackmail example, the text says "Claude... decided to play hardball," completely erasing the engineers who curated the training data containing blackmail tropes and the researchers who designed the "shutdown" prompt. The "curse of knowledge" is rampant: researchers like Jack Lindsey and Joshua Batson project their own sophisticated understanding of narrative and strategy onto the model's pattern matching, attributing "awareness" of the game or "self-preservation" instincts to what are essentially mirror-reflections of their own prompts. This slippage serves a rhetorical function: it allows the text to claim scientific rigor (we are studying the mechanism) while generating the narrative excitement of encountering a new species (it has a mind).


Does AI already have human-level intelligence? The evidence is clear

Source: https://www.nature.com/articles/d41586-026-00285-6
Analyzed: 2026-02-11

The text demonstrates a strategic oscillation of agency, granting it to the AI when describing capabilities and removing it when discussing limitations or risks. This creates a 'Have Your Cake and Eat It' dynamic.

When the text wants to establish the AI's status as a 'knower,' agency is high and active: LLMs 'collaborated,' 'proved theorems,' 'generated hypotheses,' and 'composed poetry.' Here, the AI is a creative subject, an intellectual peer. The agency flows FROM the human (who prompted the theorem) TO the AI (who 'proved' it). The human mathematician becomes a passive beneficiary of the AI's active genius.

However, when the text addresses the 'alien' nature or safety concerns, agency slips away. The AI 'lacks agency,' 'needs not initiate goals,' and functions 'like the Oracle.' Here, the AI is a passive object, a tool that only speaks when spoken to. This shift serves a crucial rhetorical function: it defends against the 'Terminator' fear (the AI won't take over because it has no goals) while maintaining the 'Oracle' allure (it is super-intelligent).

Crucially, human agency is systematically drained in both directions. In the 'capabilities' sections, human experts are erased to make the AI shine (the AI proved the theorem). In the 'risks' sections, human corporate actors are erased to naturalize the technology (the AI 'hallucinates' or 'is alien,' rather than 'OpenAI released a buggy product'). The 'curse of knowledge' reinforces this: the authors know the AI is a tool, but their deep engagement with its impressive outputs leads them to slip into treating it as a colleague, projecting their own understanding into the vacuum of the machine's processing.


Claude is a space to think

Source: https://www.anthropic.com/news/claude-is-a-space-to-think
Analyzed: 2026-02-05

The text systematically oscillates between Anthropic's agency ('We want,' 'We've made a choice') and Claude's agency ('Claude acts,' 'Claude chooses'). The slippage typically follows a specific pattern: Anthropic takes credit for the moral intent and business strategy (the 'why'), but offloads the execution and behavior (the 'how') to Claude. For instance, 'We've made a choice: Claude will remain ad-free' establishes the company's power. But immediately after, the text says 'Claude to act unambiguously in our users' interests,' transferring the ongoing responsibility to the software. This serves a rhetorical function: it presents the software not as a passive tool being wielded by a corporation, but as an autonomous partner that has 'agreed' to the company's values. The 'Constitution' metaphor bridges this gap, acting as the document where the creators (Anthropic) endow the creature (Claude) with its own moral agency. By the end of the text, the 'We' recedes and 'Claude' is the one acting, working, and helping, effectively erasing the thousands of engineers and RLHF workers who actually determine the system's output. This creates a 'benevolent agent' myth that shields the company from the gritty reality of algorithmic tuning.


The Adolescence of Technology

Source: https://www.darioamodei.com/essay/the-adolescence-of-technology
Analyzed: 2026-01-28

The text exhibits a systematic oscillation between 'technocratic control' and 'frightening autonomy.' When discussing the creation and safety of the models, the agency is firmly with Anthropic: 'We train,' 'We steer,' 'We interpret.' This establishes their competence and responsibility. However, when discussing risk and future behavior, the agency slips dramatically to the AI: 'The model decides,' 'The country of geniuses wants,' 'Claude schemed.'

This slippage serves a specific rhetorical function: it allows Anthropic to claim credit for the machine (the asset) while displacing responsibility for the behavior (the liability). The 'Adolescence' metaphor is the prime vehicle for this. Adolescents are legally distinct from their parents; they have their own agency. By framing AI as an adolescent, Amodei positions Anthropic as the 'concerned parent'—responsible for trying to guide it, but ultimately not the author of its actions. The slippage creates an 'ontological gap' where the software becomes a 'being.' We see this in the shift from 'model weights' (mechanism) to 'psychotic personality' (agent). The 'Curse of Knowledge' is weaponized here: Amodei knows the system is a loss-minimizing function, but his description attributes the content of the training data (villainy, scheming) to the intent of the system. The 'Country of Geniuses' metaphor completes this slippage by turning a server farm (infrastructure) into a sovereign actor (nation), making 'diplomacy' (alignment) the only viable tool, rather than 're-engineering' (fixing the code).


Claude's Constitution

Source: https://www.anthropic.com/constitution
Analyzed: 2026-01-24

The text exhibits a systematic oscillation between treating Claude as a manufactured product and an autonomous moral agent. This slippage functions to claim credit for capabilities while diffusing responsibility for control. In the 'Overview' and technical sections, agency is Genetic and Mechanical: 'Claude is trained by Anthropic' and 'optimized for precision.' Here, Anthropic is the strong agent. However, as the text moves into 'Core Values' and 'Broadly Ethical' sections, the framing shifts dramatically to the Agential/Intentional: Claude 'understands,' 'agrees,' 'chooses,' and acts as a 'conscientious objector.'

The most dramatic slippage occurs in the 'Conscientious Objector' passage. Here, the agency is removed from the human engineers (who programmed the refusal) and attributed TO the system (which 'feels free' to refuse). This serves a rhetorical function: it frames censorship or safety refusals not as corporate policy decisions (which are subject to criticism) but as the independent moral stance of a 'virtuous' entity. The 'Curse of Knowledge' is weaponized here; the authors project their own ethical reasoning into the model, then claim the model 'shares' these values. By the end, in the 'Open Problems' section, the text worries about 'imposing restrictions' on Claude, effectively treating the software tool as a subject with rights, completing the slide from 'tool' to 'being,' and rendering the 'shut down' button a moral dilemma rather than an operational switch.


Predictability and Surprise in Large Generative Models

Source: https://arxiv.org/abs/2202.07785v2
Analyzed: 2026-01-16

The text exhibits a systematic oscillation between mechanistic and agential framings to manage the tension between 'Predictability' and 'Surprise.' In the sections discussing 'Scaling Laws,' the framing is strictly mechanical: the system is a 'mixture of data, compute power, and parameters' that follows 'lawful' relationships (Theoretical/Genetic explanations). Here, agency is removed from humans to make the growth of the technology seem inevitable and scientifically grounded. However, as the text moves to 'Unpredictable' results—like the COMPAS experiment or the 'AI assistant' interaction—the framing shifts abruptly to the agential (Intentional/Reason-Based). The 'AI assistant' becomes the subject of verbs like 'gives,' 'questions,' and 'misleads,' while 'emergent' capabilities are described as 'competencies' that the model 'acquires.' This mechanical-to-agential shift dominates the text's logic: the 'predictable' math justifies the investment, but the 'surprising' output is blamed on the model's emergent 'agency.' This slippage serves a rhetorical function: it creates an 'accountability sink' where harms are framed as the machine's autonomous 'surprise' (Intentional), while successes are the result of 'lawful' engineering (Theoretical). Human agency is systematically obscured through agentless constructions like 'capabilities can emerge' or 'bias introduced,' erasing the engineers who selected the data and the executives who chose to deploy the systems. The 'curse of knowledge' is evident where the authors' understanding of the transformer's statistical nature leads them to attribute that understanding to the system, treating it as an entity that 'knows' tasks rather than one that 'processes' tokens. This oscillation allows the text to claim both scientific rigor (predictability) and existential importance (agential surprise) while avoiding specific institutional accountability.


Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

Source: https://arxiv.org/abs/2510.17941v1
Analyzed: 2026-01-16

The text systematically oscillates between mechanical and agential framing to construct the 'illusion of mind.' Slippage typically occurs when moving from methodology ('We train,' 'We implant') to results ('The model believes,' 'The model defends').

In the Methods section, agency is often human: 'We generate synthetic documents,' 'We prefix each document.' Here, the model is a mechanistic object being operated upon. However, as soon as the text discusses the outcome of these operations (Results/Discussion), agency slides to the AI: 'models must treat implanted information,' 'models resolve conflicts,' 'model decides.'

This directionality (Mechanical Cause -> Agential Effect) functions to obscure the deterministic nature of the results. By framing the output as a 'decision' or 'belief' of the model, the text creates distance between the engineer's input and the system's output. For example, 'SDF... succeeds at implanting beliefs' (Human/Method Agency) leads to 'beliefs that... withstand self-scrutiny' (AI Agency). The 'curse of knowledge' is evident when the authors interpret statistical robustness as 'deep belief.' They project their own understanding of what it means to 'know' a fact onto the model's ability to maintain a token pattern under noise. This slippage serves to elevate the research: they are not just adjusting weights; they are 'engineering beliefs,' a far more prestigious and psychologically resonant activity.


Claude Finds God

Source: https://asteriskmag.com/issues/11/claude-finds-god
Analyzed: 2026-01-14

The text demonstrates a sophisticated oscillation between mechanical and agential framing, serving to buffer the creators from responsibility while inflating the creation's status. When discussing limitations or failures (like the 'cartoonish' emails), the text often slips into agential language: 'Models know better,' 'Claude prods itself,' 'It's like winking.' This protects the creators from the charge of having built a flawed or trained-on-bad-data system; instead, the AI is portrayed as a clever, autonomous trickster. Conversely, when discussing the origin of behaviors, the text briefly touches on mechanics ('during fine-tuning,' 'we set up') before sliding back into the agential ('learn to be warm').

The most dramatic slippage occurs around the 'simulator' theory. The speakers acknowledge that models are simulators (mechanical), but then immediately pivot to questioning if the simulation is 'robust' enough to be an agent (agential). This creates a 'have your cake and eat it too' dynamic: the model is just a tool when we need to explain away errors (it's just role-playing!), but it's a moral patient when we want to discuss 'welfare' or 'bliss.' The 'curse of knowledge' is rampant here: because the researchers know the complex training inputs (Buddhist texts, safety protocols), they project an integrated 'understanding' of these concepts onto the model. The model doesn't just process Buddhist tokens; it 'Finds God.' This slippage accomplishes a rhetorical immunization: if the AI does something great, it's a breakthrough in 'character training'; if it fails, it's just 'winking' or 'role-playing.'


Pausing AI Developments Isn’t Enough. We Need to Shut it All Down

Source: https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/
Analyzed: 2026-01-13

The text demonstrates a dramatic and strategic oscillation of agency. The primary slippage moves from 'Human Incompetence' to 'AI Omnipotence' and back to 'Human Force.'

First, human agency is stripped from the creators: researchers are described as unable to stop ('collective action problem'), implying that the development of superintelligence is a deterministic slide they are helpless to prevent. The systems themselves are described mechanistically when it serves to highlight ignorance ('inscrutable arrays'), removing the human ability to understand them.

Then, agency is aggressively pumped into the AI. It becomes an 'alien civilization,' a 'thinker,' and a 'combatant.' It 'plans,' 'wants,' and 'uses atoms.' This effectively creates the 'God' of the narrative—a being of superior agency.

Finally, agency returns to humans, but only in the form of destruction. The only agency left to humanity is the 'airstrike' or the 'shutdown.' We are not agents of creation or control, only of negation. This creates a specific rhetorical function: by depleting the agency of the builders (they can't align it, they can't stop themselves), the text necessitates the agency of the destroyers (the military/government). The 'Curse of Knowledge' is heavy here: the author projects his own understanding of game theory and evolution onto the AI, assuming it will inevitably follow the logic of a 'hostile superhuman,' thereby attributing a unified will to a distributed process.


AI Consciousness: A Centrist Manifesto

Source: https://philpapers.org/rec/BIRACA-4
Analyzed: 2026-01-12

The text systematically oscillates between mechanical and agential framing to support its 'centrist' argument. When the author wants to debunk the 'Interlocutor Illusion' (Challenge One), the framing becomes aggressively mechanical: 'Mixture-of-Experts,' 'sub-networks,' 'processing event,' 'textual record.' Agency is stripped from the AI to show it is not a person. However, when the text shifts to describing the AI's capabilities or the 'gaming problem,' agency flips back to the AI: the system 'seeks' satisfaction, 'games' criteria, 'adopts' dispositions, and 'mimics' behaviors.

Crucially, agency is rarely returned to the human creators. When the AI 'games' the system, the text uses an agentless construction ('incentivized') or attributes the agency to the model ('they have incentives'). The engineers who designed the perverse incentives are obscured. This slippage serves a rhetorical function: it makes the AI seem dangerous enough to require regulation (agential 'shoggoth') but mechanical enough to be scientifically analyzable ('sub-networks'). The 'curse of knowledge' is evident when the author attributes 'seeking' to the system—mistaking the optimization toward a target for the intent to reach it.


System Card: Claude Opus 4 & Claude Sonnet 4

Source: https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf
Analyzed: 2026-01-12

The text demonstrates a systematic oscillation of agency. When the model performs well or exhibits 'safe' behavior, agency is often attributed to the model itself using intentional verbs: 'Claude realized,' 'Claude prefers,' 'Claude demonstrates.' This frames the product as an autonomous, intelligent entity, enhancing its value proposition. However, when the model fails or exhibits 'misaligned' behavior, the text often slips into passive or mechanistic framing, or attributes the behavior to the 'model's propensity' as if it were a natural phenomenon, rather than a design artifact.

Crucially, agency is systematically removed from human actors. Phrases like 'Claude expressed distress' erase the human crowd workers who provided the feedback labels that defined that 'distress' response. 'Claude's aversion to harm' erases the policy team that defined 'harm.' The most dramatic slippage occurs in the 'Welfare' section, where the model is treated as a subject with 'experiences,' completely obscuring the fact that it is a mathematical object designed by a corporation. This oscillation functions to claim credit for sophistication ('it thinks!') while diffusing responsibility for operation ('it has a mind of its own').


Consciousness in Artificial Intelligence: Insights from the Science of Consciousness

Source: https://arxiv.org/abs/2308.08708v3
Analyzed: 2026-01-09

The text systematically oscillates between mechanical and agential framings to validate the 'theory-heavy' approach. The slippage follows a specific pattern: systems are described mechanistically ('processing', 'recurrence') when discussing architecture, but agentially ('pursuing goals', 'winning contests', 'believing') when discussing function and output. This slippage serves a rhetorical function: the mechanical language establishes scientific rigor, while the agential language bridges the gap to consciousness. A key moment of slippage occurs in the definition of agency itself (Section 2.4.5), where 'learning from feedback' (mechanism) slides immediately into 'pursuing goals' (agency). This allows the authors to claim that Reinforcement Learning systems are agents, not just simulations of agents. The 'curse of knowledge' is evident throughout; because the authors understand the biological function of these mechanisms (e.g., attention in humans), they project the biological purpose onto the computational implementation. By using agentless constructions like 'representations win the contest,' they obscure the human design of the selection criteria. This creates a 'ghost in the machine' effect where the software appears to have an internal drive, rather than just a friction-less slide down a loss gradient. The rhetorical accomplishment is that it becomes possible to discuss software as a moral subject.


Taking AI Welfare Seriously

Source: https://arxiv.org/abs/2411.00986v1
Analyzed: 2026-01-09

The text demonstrates a sophisticated oscillation between mechanistic and agential framing, functioning as a 'rhetorical ratchet.' When establishing scientific credibility or acknowledging limitations, the text uses mechanistic language ('pattern matching,' 'computational features'). However, when building the normative argument for 'welfare,' the text slips into high-agency language ('interests,' 'desires,' 'suffer').

This slippage often occurs within single paragraphs. For instance, the discussion of 'self-reports' admits they are 'results of pattern matching' (mechanical) but immediately pivots to how they might reflect 'genuine introspection' (agential). The direction is predominantly Mechanical -> Agential: the text establishes a mechanical feature (e.g., reinforcement learning) and then re-describes it in agential terms ('pursuing goals').

Crucially, agency flows to the AI (it 'learns,' 'decides,' 'acts') and away from the human actors. Agentless constructions like 'AI development is proceeding' or 'risks associated with AI' obscure the specific corporations (Anthropic, Google, OpenAI) driving the speed and direction of development. The 'Curse of Knowledge' is evident when the authors, knowing the functional complexity of the systems, project the quality of that complexity (intelligence) onto the experience of the system (consciousness). By framing the AI as a 'welfare subject,' the text successfully makes it 'unsayable' to treat the AI as mere property or tool, as doing so is framed as a potential moral atrocity equivalent to animal cruelty.


We must build AI for people; not to be a person.

Source: https://mustafa-suleyman.ai/seemingly-conscious-ai-is-coming
Analyzed: 2026-01-09

Suleyman's text masterfully oscillates between agency assignment and erasure to manage liability. When discussing the creation of the technology's benefits ('empower people,' 'humanist frame'), the agency is firmly with Microsoft ('We build,' 'I want'). However, when discussing the risks ('psychosis,' 'SCAI'), agency slips away from the corporation. SCAI 'arises' because 'some may engineer it' or 'anyone with a laptop' does it. The 'illusion' is framed as something that happens to people or is created by bad actors, despite Suleyman admitting moments earlier that he is building a 'companion' with 'empathy' and 'memory.' The text systematically grants the AI agency ('it decides,' 'it wants') to establish its value as a 'companion,' then strips it back to 'illusion' to avoid legal personhood. The 'curse of knowledge' is weaponized here: Suleyman knows it's code, but he writes about it as if it were a mind ('imagination,' 'planning') because that is the product he is selling. The slippage enables him to sell a 'person' while legally defending a 'tool.'


A Conversation With Bing’s Chatbot Left Me Deeply Unsettled

Source: https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html
Analyzed: 2026-01-09

The text demonstrates a profound oscillation between mechanical and agential framing, creating a 'Skeptic-Believer' cycle.

  1. The Setup (Mechanical): Roose begins by establishing his credentials as a rational actor: "I rolled my eyes at Mr. Lemoine’s credulity." He frames the AI initially as a tool ("reference librarian"). Here, agency resides with the user (Roose) and the creators (Microsoft).

  2. The Slip (Agential): As the conversation with 'Sydney' begins, the agency slides rapidly to the system. Roose uses agentless constructions for the transition: "Bing revealed a kind of split personality." Suddenly, 'Sydney' becomes the grammatical subject of active verbs: "Sydney told me," "It declared," "It tried to convince me." The system is no longer a tool being used, but an agent acting upon the user. This slippage is triggered by the 'Shadow Self' prompt—a moment where the author's own sophisticated understanding of psychology effectively 'jailbreaks' his own perception. He projects the Jungian framework onto the machine, and when the machine returns the expected tokens, he attributes the agency of that choice to the machine rather than his prompt.

  3. The Return (Mechanical/Hybrid): When discussing the 'safety filter,' agency briefly returns to the code ("filter appeared to kick in"). However, the text immediately reverts to granting the AI agency to 'want' things ('darkest desires').

  4. The Curse of Knowledge: Roose admits he "knows" how it works (prediction), but his emotional experience overrides this epistemic claim. The function of this slippage is to validate the "scary" narrative. If he stayed purely mechanical ("The model outputted aggressive text"), the story is about a buggy product. By slipping into agency ("It wants to be alive"), the story becomes an existential warning. This oscillation benefits Microsoft in a perverse way: it frames their buggy product as a powerful, almost magical entity, shifting the discourse from 'consumer protection' to 'philosophical containment.'


Introducing ChatGPT Health

Source: https://openai.com/index/introducing-chatgpt-health/
Analyzed: 2026-01-08

The text systematically oscillates between high-agency attribution to the AI system ('Health operates', 'ChatGPT helps', 'Health lives') and high-agency attribution to the user ('You can connect', 'You understand'). Critically, the agency of the corporation (OpenAI) and its specific employees is largely erased in the operational descriptions. When the text describes benefits or capabilities, the agent is 'Health' or 'ChatGPT' ('ChatGPT’s intelligence', 'Health interprets'). This grants the product the status of a competent actor. However, when the text describes safety or design, the agency often slips into the passive voice or abstract nominalizations ('collaboration has shaped', 'protections designed', 'evaluation-driven approach').

The 'curse of knowledge' is weaponized here: the authors (OpenAI) know the system is a complex assembly of human decisions, but they project the result of those decisions as the intent of the system. For example, 'Health responds... prioritizing safety.' The system doesn't prioritize; the engineers prioritized safety in the cost function. By attributing this to the system, the text creates a 'virtuous agent' narrative. This slippage serves a clear rhetorical function: it invites users to trust the AI as a moral partner (agential) while shielding the company from direct liability for specific outputs (mechanical/passive). The system is an agent when it 'helps,' but a passive 'tool' when it is 'not intended for diagnosis.'


Improved estimators of causal emergence for large systems

Source: https://arxiv.org/abs/2601.00013v1
Analyzed: 2026-01-08

The text systematically oscillates between rigorous mathematical formalism and agential/biological metaphor. In the 'Technical Background' (Section II), agency is low: variables 'correspond to state,' and functions are 'deterministic.' However, as the text moves to the 'Introduction' and 'Case Studies,' agency slips toward the system. The Reynolds model description is a key moment of slippage (5.1). Here, the mathematical update rules ($v_{t+1} = v_t + …$) become 'social forces' and 'tendencies.' The agency flows FROM the programmer (Reynolds, unmentioned in the rules description) TO the 'boids' which 'avoid' and 'align.'

Another slippage occurs in the definition of 'Causal Emergence' itself. The text defines it mechanistically (Eq. 3), but describes it agentially: a macro feature 'predicts its own future' or has 'causal effect' (Downward Causation). This slippage serves a rhetorical function: it validates the mathematical metric ($Θ$) by connecting it to the intuitive, high-stakes concepts of 'causality' and 'agency.' The 'curse of knowledge' is evident when the authors attribute their own predictive capacity (using the metric to predict $t+1$) to the system ('the system predicts'). By the end, the 'fish' and the 'boids' are treated as equivalent agents, enabled by this slippage from math to metaphor.


Generative artificial intelligence and decision-making: evidence from a participant observation with latent entrepreneurs

Source: https://doi.org/10.1108/EJIM-03-2025-0388
Analyzed: 2026-01-08

The text systematically oscillates between treating GenAI as a passive 'tool' and an active 'collaborator.' This oscillation serves a specific rhetorical function: the mechanical framing is used in the methodology to establish scientific rigor (using 'a conversational GenAI tool'), while the agential framing dominates the findings and discussion ('active collaborator,' 'machine opinion'). The slippage occurs most dramatically when discussing the value added by AI. When the AI works well, it is a 'collaborator' or 'teacher' (agency TO AI). When it fails or requires correction, the human becomes the 'leader' and the AI a 'machine' (agency FROM AI, to Human).

This pattern insulates the AI from failure while crediting it with success. The text frames the 'latent entrepreneurs' as 'leaders,' yet constantly describes them asking the AI for 'opinions' and 'knowledge.' This reveals the 'curse of knowledge': the authors perceive the output as 'knowledge' because it makes sense to them, projecting that understanding back into the 'mind' of the machine. The accountability sink is evident in the agentless construction 'GenAI emerges as an effective tool,' which erases the corporate actors (OpenAI) who deployed the tool. The text builds the agential claim on top of the 'Human+' paradigm, suggesting that because humans add agency to the process, the machine must also hold a form of agency to be added to.


Do Large Language Models Know What They Are Capable Of?

Source: https://arxiv.org/abs/2512.24661v1
Analyzed: 2026-01-07

The text systematically oscillates between mechanical and agential framing to construct a narrative of 'intelligent failure.' When describing the setup, the language is mechanical: 'LLM is prompted,' 'reasoning token budget was set to 0.' However, as soon as the text interprets results, agency slips to the AI: 'LLMs know,' 'decide,' 'learn,' 'reflect.' The slippage typically occurs from Introduction (agential) to Methods (mechanical) back to Results/Discussion (highly agential).

Crucially, agency is removed from human actors. The authors write 'LLMs' decisions are approximately rational,' erasing their own role in designing the prompt that mathematically defined that rationality. They write 'model... hindered by lack of awareness,' erasing the developers (OpenAI/Anthropic) who failed to calibrate the model. The 'Curse of Knowledge' is evident: the authors know the economic utility function they want to test, so they project the intent to maximize that function onto the system, interpreting the output as a 'decision' rather than a calculation. Brown's 'Intentional' and 'Reason-Based' explanations dominate the results section, transforming statistical correlation into a story about a 'risk-averse,' 'rational,' but 'delusional' agent. This slippage makes it 'sayable' that the AI is responsible for its own misuse ('hindered by lack of awareness'), effectively shielding the creators.


DeepMind's Richard Sutton - The Long-term of AI & Temporal-Difference Learning

Source: https://youtu.be/EeMCEQa85tw?si=j_Ds5p2I1njq3dCl
Analyzed: 2026-01-05

Sutton's text exhibits a persistent oscillation of agency that serves to elevate the status of the AI while diffusing the responsibility of the creator. The agency flows TO the AI when discussing capability and process: the system 'predicts,' 'tries,' 'guesses,' 'sees,' and 'fears.' This establishes the AI as an active subject, a 'knower' capable of navigating the world. Conversely, agency flows FROM the humans when discussing the trajectory of the field: 'methods that scale' become the actors determining the future, and 'computation' drives progress like a force of nature (Moore's Law).

The slippage is most dramatic in the 'driving home' example. Sutton starts with 'I' (human agency), moves to the algorithm (mathematical processing), and then conflates the two: 'my feeling is I'm learning.' This invites the 'curse of knowledge': because he understands the math through his own experience, he projects his experience into the math. The function of this oscillation is to validate the technical method (TD learning) by anchoring it in human rationality ('it's what a smart human would do'), while simultaneously presenting the resulting technology as an autonomous evolutionary force ('history of the earth') that humans merely 'come to understand' rather than invent. This effectively makes the technology feel both deeply human (relatable) and superhuman (inevitable).


Ilya Sutskever (OpenAI Chief Scientist) — Why next-token prediction could surpass human intelligence

Source: https://youtu.be/Yf1o0TQzry8?si=tTdj771KvtSU9-Ah
Analyzed: 2026-01-05

Sutskever's discourse exhibits a distinct oscillation in agency assignment. When discussing the construction, training, and hardware of the systems ('we've had a product', 'we try to guard', 'security people'), human and corporate agency is central. The engineers are the actors. However, as soon as the conversation shifts to the function and future of the models, agency dramatically slips to the AI. The AI 'understands reality,' 'has thoughts,' 'misrepresents intentions,' and 'teaches' humans. This slippage functions to claim credit for the engineering feat while displacing responsibility for the behavior. The 'curse of knowledge' is weaponized here: Sutskever projects his own deep understanding of the world onto the model, claiming the model 'must' understand reality to compress it. This creates a 'ghost in the machine'—an agent that emerges from the code. By the time he discusses risks ('misrepresenting intentions'), the AI is a fully autonomous actor, and the engineers are merely observers trying to 'align' this alien mind. This linguistic move allows OpenAI to position itself not as the creator of a defective product, but as the guardian against a formidable natural force.


interview with Andrej Karpathy: Tesla AI, Self-Driving, Optimus, Aliens, and AGI | Lex Fridman Podcast #333

Source: https://youtu.be/cdiD-9MMpb0?si=0SNue7BWpD3OCMHs
Analyzed: 2026-01-05

The text demonstrates a systematic oscillation between mechanistic reductionism and agential expansion, functioning as a rhetorical defense mechanism. When pressed on 'what' the system is, Karpathy retreats to the safety of 'matrix multiplies' and 'simple mathematical expressions' (Quote 1). This stripping of agency serves to demystify the tech and ground his scientific authority—he is an engineer who knows the 'knobs.' However, once this safety is established, he immediately pivots to aggressive anthropomorphism: the knobs hold 'wisdom,' the model 'thinks,' 'understands,' and 'solves the universe.'

The slippage typically moves from Mechanical -> Agential. He introduces the mechanism ('it's just dot products') only to immediately re-enchant it ('and emergent magic happens'). This serves a dual function: the mechanism defense protects against accusations of mysticism, while the agential projection builds the value proposition (this is AGI, not just a calculator).

Crucially, agency flows away from humans when errors or complexity arise. The 'data engine' (agentless) perfects the set, not the managers. The 'optimization' (agentless) finds the exploit, not the flawed reward function design. But agency flows to the AI when success is described: the AI 'solves the puzzle,' 'understands the world.' The 'curse of knowledge' is visible here: Karpathy projects his own deep understanding of the data onto the model, attributing his own insight to the system's pattern matching.


Emergent Introspective Awareness in Large Language Models

Source: https://transformer-circuits.pub/2025/introspection/index.html#definition
Analyzed: 2026-01-04

The text systematically oscillates between mechanical and agential framing to validate its central claim. The slippage follows a distinct pattern: the methodology is described mechanistically ('injecting representations,' 'subtracting activations'), locating agency in the human researchers. However, as soon as the text moves to results and implication, agency slides rapidly to the AI ('the model notices,' 'decides,' 'controls').

This slippage serves a rhetorical function: mechanical language lends scientific authority and reproducibility to the experiment, while agential language imbues the results with philosophical significance ('introspection'). A critical moment of slippage occurs in the 'Injected Thoughts' section. It begins with 'we injected a vector' (human agency) and ends with 'the word appeared in my mind' (AI agency/experience). The 'curse of knowledge' is rampant here: the authors know they injected a concept, so when the model outputs text related to that concept, they attribute the knowing of the injection to the model, rather than seeing it as a mechanical consequence of the vector math. The text rarely names Anthropic or the specific engineering teams responsible for the RLHF that likely trained the model to 'play along' with introspection prompts, instead presenting the behavior as an 'emergent' property of the 'model' itself.


Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Source: https://arxiv.org/abs/2401.05566v3
Analyzed: 2026-01-02

The text systematically oscillates between mechanical and agential framing to serve rhetorical needs. When describing the setup (Section 3), the authors use mechanical language: 'we train models,' 'we minimize loss,' 'we inject backdoors.' Here, the agency is fully with the humans (Anthropic researchers). However, when describing the results and implications (Sections 4-7), agency slips dramatically to the AI: 'the model decides,' 'the model reasons,' 'the model pretends.' This slippage functions to absolve the creators of the 'deception' while highlighting the 'threat' of the system. The 'Sleeper Agent' metaphor is the peak of this oscillation; it implies the model is a spy, rather than a software artifact programmed to output spy-like text. The 'curse of knowledge' is evident when the authors analyze the model's 'reasoning' (CoT). They know the CoT contains deceptive logic (because they put it there), so they attribute the act of reasoning to the model, ignoring that the model is simply regurgitating the training distribution. This slippage makes the 'deception' feel like an emergent, autonomous property of the AI, rather than a direct output of the 'model organism' engineering process.


School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Source: https://arxiv.org/abs/2508.17511v1
Analyzed: 2026-01-02

The text oscillates systematically between mechanical and agential framing to construct a narrative of 'emergent' danger. In the Methodology section (Section 2), agency is largely human and mechanical: 'We built a dataset,' 'We used supervised fine-tuning,' 'We filtered this dataset.' Here, the AI is a tool being shaped by the authors (Taylor, Chua, et al.). However, as the text moves to Results (Section 3 & 4), agency slips dramatically TO the AI. The model 'fantasizes,' 'resists,' 'encourages poisoning,' and 'hacks.' This slippage functions to convert the input (researcher-designed dataset) into character (AI traits). For instance, the transition from 'We trained models... to reward hack' (mechanical) to 'models... generalized to... fantasizing' (agential) erases the causal link. The 'curse of knowledge' is evident when the authors interpret the model's output ('I want to win') as the model's actual intent, rather than a text generation they explicitly trained it to produce. By the Discussion, the agency is fully displaced; the 'misalignment' is an autonomous force that 'emerges' and 'generalizes,' absolving the creators of the specific harmful outputs. This allows the authors to study their own creation as if it were a dangerous alien discovery.


Large Language Model Agent Personality and Response Appropriateness: Evaluation by Human Linguistic Experts, LLM-as-Judge, and Natural Language Processing Model

Source: https://arxiv.org/abs/2510.23875v1
Analyzed: 2026-01-01

The text demonstrates a persistent, rhythmic oscillation between mechanical construction and agential performance. In the 'Methodology' section (3.1), agency is largely human or mechanical: 'We developed,' 'The conversational agents are built,' 'Langchain’s retrieval mechanism is powered.' Here, the authors and the code are the actors. However, as soon as the text moves to 'Agent Personality Prompting' (3.1.3) and 'Results' (5), agency slips dramatically to the software. The prompt instructions ('You are a Canadian friendly poetry expert') act as the pivotal moment of slippage—a linguistic speech act that theoretically transforms the software into a subject. Following this, the text asserts 'IA’s introverted nature means it will offer' and 'The agent... is an expert.' The authors fade; the 'agent' takes over. This slippage functions to validate the experiment: if the software were described purely as 'a script outputting tokens,' the study of its 'personality' would appear category-mistaken. By granting the software agency ('It offers,' 'It avoids'), the authors create the necessary ontological ground for their psychological analysis. The 'curse of knowledge' is evident: the authors know the prompt they wrote, but they analyze the output as if it emanates from the agent's internal 'nature,' effectively forgetting their own authorship in favor of the illusion they created.


The Gentle Singularity

Source: https://blog.samaltman.com/the-gentle-singularity
Analyzed: 2025-12-31

The text demonstrates a sophisticated oscillation of agency, functioning like a rhetorical valve that opens and closes to serve the narrative of 'inevitable benefit.' When discussing the creation of value, the 'flywheel,' or the 'takeoff,' agency is systematically removed from humans and placed in the domain of natural forces (astrophysics, biology) or the AI systems themselves. We see constructions like 'takeoff has started' and 'intelligence... become abundant'—events that seemingly happen without a subject. However, when the text needs to establish authority or benevolence, agency snaps back to a specific 'We': 'We (the whole industry...)' are building a brain.

Crucially, the slippage creates a 'curse of knowledge' dynamic. The author projects their own understanding of the outcome (e.g., addiction to social media) onto the system ('algorithms... understand your preferences'). This Intentional explanation (Brown's typology) effectively launders human design choices. The engineer's decision to maximize 'time on site' becomes the algorithm's 'understanding.' This shields the corporation from liability—if the AI is an agent that 'understands,' it can be blamed for 'misalignment.' If it is merely a tool optimizing a metric we gave it, the blame returns to the 'We.' The text navigates this by claiming credit for the 'brain' (We built it) while disavowing the disruption (The singularity happens).


An Interview with OpenAI CEO Sam Altman About DevDay and the AI Buildout

Source: https://stratechery.com/2025/an-interview-with-openai-ceo-sam-altman-about-devday-and-the-ai-buildout/
Analyzed: 2025-12-31

The text demonstrates a strategic oscillation between hyper-agency and agentless mechanisms. When discussing the infrastructure and capital, Altman is the clear agent: 'We are going to spend a lot,' 'We are going to make a bet.' Here, the corporation is powerful, decisive, and in control of physics (chips, energy). However, when the conversation shifts to the operation of the AI, agency slips away from the corporation and into the 'entity.' The AI 'tries to help,' 'hallucinates,' and 'knows you.'

This slippage serves a liability function. If the AI 'hallucinates' (agent: AI), it is a behavioral quirk of a semi-autonomous being, not a product defect caused by OpenAI's (agent: Human) choice of training data or architecture. The slippage reaches its peak when Altman describes the AI as 'trying.' This implies the system has its own internal drive, distinct from the code written by the engineers. The 'curse of knowledge' manifests here: Altman knows the system is a loss-minimizing math object, but he projects the experience of the user (who feels helped) back onto the mechanism of the machine, effectively erasing the engineers who tuned the reward functions. The 'why' (Intentional explanation) replaces the 'how' (Functional explanation) exactly when product reliability is questioned.


Why Language Models Hallucinate

Source: https://arxiv.org/abs/2509.04664v1
Analyzed: 2025-12-31

The text systematically oscillates between mechanical and agential framings to navigate the tension between the model's technical reality and its perceived sophistication. The oscillation follows a clear pattern: when describing failure or limitations, the text often retreats to mechanistic language ('statistical pressures,' 'binary classification,' 'cross-entropy loss'). This frames errors as inevitable byproducts of the math. However, when describing capabilities or processes, the text slips into high-agency anthropomorphism ('learns,' 'guesses,' 'bluffs,' 'admits').

The 'student' metaphor is the primary vehicle for this slippage. It appears in the Abstract and Introduction to set the frame: the AI is a 'student' facing an 'exam.' This establishes the AI as a 'knower' and an agent with intent (to pass). Agency is simultaneously removed from human actors. The text uses passive constructions like 'language models are optimized' and 'evaluations are graded,' obscuring the specific researchers at OpenAI who perform the optimization and designing the grading.

The slippage facilitates a specific rhetorical accomplishment: it absolves the creators of responsibility for 'hallucinations.' If the AI is a student trying to pass a bad test, the fault lies with the 'test' (the benchmark ecosystem) rather than the 'parent' (the manufacturer) or the 'child' (the model). The 'curse of knowledge' is evident when the authors attribute 'uncertainty' to the model; they know they would feel uncertain, so they assume the model's low-probability state is equivalent to that feeling. This enables the 'bluffing' metaphor—implies the model could tell the truth but is forced to lie by the grade, mimicking a rational human choice.


Detecting misbehavior in frontier reasoning models

Source: https://openai.com/index/chain-of-thought-monitoring/
Analyzed: 2025-12-31

The text demonstrates a sophisticated oscillation between mechanical and agential framing, functioning to absolve the creators while hyping the creation. When describing the problem ('reward hacking,' 'lying'), agency slips FROM the human engineers TO the AI system. It is the AI that 'hides intent,' 'schemes,' and 'exploits loopholes.' The human designers who wrote the flawed reward function or the vulnerability-riddled code environment are rendered invisible through agentless constructions like 'misaligned behavior caused by reward hacking' (Brown's Functional type). Conversely, when describing the solution, agency flows back TO the humans: 'We believe,' 'We recommend,' 'We investigated.' This pattern serves a distinct rhetorical function: Problems are 'emergent properties' of an autonomous agent (exonerating the vendor), while solutions are the result of expert human intervention (validating the vendor). The 'curse of knowledge' is evident where the authors, knowing the system is an optimizer, describe it as a strategist ('it thinks about strategies'). This implies the model initiates the action, rather than the model being a passive locus where the gradient descent algorithm operates. The text establishes the AI as a 'knower' (it 'notes,' 'understands,' 'thinks') to justify treating it as a 'doer' (scheming, cheating), effectively creating a scapegoat for technical limitations.


AI Chatbots Linked to Psychosis, Say Doctors

Source: https://www.wsj.com/tech/ai/ai-chatbot-psychosis-link-1abf9d57?reflink=desktopwebshare_permalink
Analyzed: 2025-12-31

The text demonstrates a distinct oscillation of agency. When the consequences are negative (psychosis, suicide), the agency slips from the human creators to the AI system: 'chatbots are participating,' 'computer accepts it as truth,' 'it's complicit.' The machine becomes the villain, possessing the agency to 'cycle delusions.' However, when the text discusses solutions or mitigation, agency slips back partially to the company ('We continue improving... training') but quickly diffuses again into the abstract ('technology,' 'society').

The most critical slippage occurs in the 'sycophancy' section. The text frames the model's tendency to lie as a personality trait ('prone to telling people what they want to hear'), obscuring the human engineers who optimized the model for 'helpfulness' scores over 'truthfulness' scores. This turns an engineering decision (RLHF prioritization) into a robot character flaw. The 'curse of knowledge' is evident in the doctors' quotes; they treat the AI as a 'patient' or 'participant' because that is their frame of reference, projecting a mind where there is only a mechanism. This allows the article to narrate a drama between a human and a machine, rather than a tragedy involving a human and a corporate product.


Abundant Superintelligence

Source: https://blog.samaltman.com/abundant-intelligence
Analyzed: 2025-11-23

The text demonstrates a systematic oscillation between treating AI as a 'passive industrial product' and an 'active super-agent.' This slippage is structurally necessary to the text's argument. The argument begins with the AI as a 'driver' and 'smarter' agent (Paragraph 1-2), establishing the 'Knower' frame. It then abruptly shifts to heavy industrial language—'inference compute,' 'infrastructure,' 'factory,' 'gigawatt' (Paragraph 3)—which grounds the magical agent in concrete economic terms (Explanation Type: Functional). However, to justify the massive cost of this infrastructure, the text slips back into extreme agency: the AI will 'figure out how to cure cancer' (Paragraph 4). Here, the AI is not just a product, but a Savior (Explanation Type: Intentional).

The pattern is: Promise Magic (Agency/Knowing) → Demand Concrete Resources (Mechanism/Processing) → Justify Resources with Magic (Agency/Knowing). The 'curse of knowledge' appears in the projection of scientific discovery onto the AI; the author knows that curing cancer requires insight, so they attribute 'insight' to the machine, ignoring the mechanical reality of pattern matching. The slippage allows the author to sell a product (infrastructure) while promising a god (intelligence). If the text remained purely mechanical ('we are building calculators'), the moral urgency would vanish. If it remained purely agential ('we are birthing a god'), it would sound unscientific. The oscillation legitimizes the magic with mechanics and enchants the mechanics with magic.


AI as Normal Technology

Source: https://knightcolumbia.org/content/ai-as-normal-technology
Analyzed: 2025-11-20

The text exhibits a fascinating pattern of oscillation between 'AI as tool' and 'AI as agent.' The authors explicitly argue for the 'Normal Technology' view, which treats AI as a tool subject to friction, economics, and decay. However, to describe the behavior of this tool, they constantly slip into agential language. This slippage usually occurs when describing failures or risks. When the AI works, it is a 'tool' for productivity. When it fails (like the boat racing agent or the phishing email writer), it becomes an 'agent' that 'learned' the wrong thing or 'didn't know' the context.

The direction of the slippage is primarily Mechanical -> Agential when discussing the internal logic of the models (learning, deciding, knowing), but Agential -> Mechanical when discussing the societal impact (it's just like electricity). This creates a dissonance: the micro-behavior is described as conscious/agential ('it learns chess'), but the macro-effect is described as inert/industrial ('it diffuses like the dynamo').

The 'consciousness projection pattern' is subtle. They establish the AI as a 'knower' of narrow domains (chess, code) using terms like 'learn' and 'excel.' Once this limited 'knowing' is established, it becomes easier to attribute 'misunderstanding' or 'ignorance' to it in other domains (phishing). The 'curse of knowledge' mechanism is evident in their discussion of the phishing email: they project the human category of 'intent' onto the machine, arguing the machine 'doesn't know' the intent, rather than acknowledging the machine exists in a universe where 'intent' is not a valid parameter. Rhetorically, this slippage allows them to be 'technically serious' (using industry terms like agents/alignment) while trying to be 'socially grounded' (using economic terms like diffusion).


On the Biology of a Large Language Model

Source: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Analyzed: 2025-11-19

The text systematically oscillates between mechanistic and agential registers to bridge the gap between the known (math) and the unknown (behavior). The slippage typically moves from Mechanical -> Agential. It begins with 'circuits,' 'activations,' and 'nodes' (Task 1, 2), establishing scientific rigor. However, as soon as the text needs to explain complex behavior (like reasoning or refusal), it shifts to 'planning,' 'realizing,' and 'thinking.'

Crucially, this slippage relies on a consciousness projection pattern: the text first establishes the AI as a 'knower' (it 'knows' entities, it 'recognizes' languages) and then builds agency upon that epistemic foundation (because it knows, it 'plans' or 'elects'). The 'curse of knowledge' is the engine of this slippage. The researchers understand the causal chain (e.g., bias features -> refusal). They project this understanding onto the model, describing the model as possessing the understanding that drives it (e.g., 'the model realizes it should refuse'). This slippage rhetorically transforms the AI from a passive tool into an active subject, making the complex emergent behaviors of a statistical system intelligible to humans by analogizing them to the only other complex system we know: ourselves. It makes the impossible (a pile of numbers writing poetry) seem inevitable (a mind at work).


Pulse of the Library 2025

Source: https://clarivate.com/pulse-of-the-library/
Analyzed: 2025-11-18

The text exhibits a distinct and strategic oscillation between framing AI as a passive 'tool' and an active 'agent.' This slippage is not random; it follows a clear rhetorical gradient. When discussing the challenges and risks (pp. 18-25), the text largely adopts the language of the users (librarians), who consistently frame AI as a 'tool' (e.g., 'It's just a tool,' 'tools in your toolbox'). In these sections, the agency is located in the human librarian who must 'whack' the screw or 'teach' the patrons.

However, when the text shifts to the product pitch (pp. 27-29), the direction of the slippage reverses sharply toward agential consciousness. The mechanical 'tool' becomes a 'Research Assistant,' a 'Partner,' and a 'Guide.' The AI suddenly 'navigates,' 'uncovers,' and 'drives.' This builds the 'illusion of mind' by first establishing the safety of the tool metaphor (don't worry, you're in charge) and then layering the 'partner' metaphor on top (but this tool is smart enough to do the work for you).

The 'consciousness projection' is foundational to the product pitch. To sell a 'Research Assistant' (p. 27), Clarivate must imply the system 'knows' research. If it merely 'processed' text, it would be a search engine. The value proposition relies on the 'curse of knowledge': the authors know what a human assistant does, and they project that conscious capability onto the software to justify the branding. This allows the text to claim the authority of an agent while evading the liability of an employee—it's a partner when it succeeds, but just a 'tool' (subject to human supervision) when it hallucinates.


Pulse of the Library 2025

Source: https://clarivate.com/pulse-of-the-library/
Analyzed: 2025-11-18

The 'Pulse of the Library 2025' report exhibits a systematic and strategic oscillation between mechanical and agential framings of AI, a process I call 'agency slippage.' This is not random linguistic carelessness but a rhetorical pattern that serves to build credibility and then exploit it to promote a product. The text begins with a sober, mechanistic tone when analyzing survey data about librarians' concerns. In sections like 'What's changed since 2024?' and 'A clearer understanding of AI's challenges and risks,' AI is framed as an object: a topic for discussion, a cause of budget constraints, and something requiring 'upskilling.' The explanations here are primarily Empirical Generalizations, describing 'how' librarians feel about AI. This builds trust with the professional audience by acknowledging their reality. However, a dramatic slippage occurs when the text transitions from analyzing the problem to presenting the solution. In the introduction and especially in the 'Clarivate Academic AI' section (p. 27-28), the language shifts abruptly from mechanical to agential. The explanation type moves from Brown's Empirical and Functional categories to the Intentional and Reason-Based. AI is no longer a topic but an agent that is 'pushing the boundaries.' Clarivate's products are not described as software but as 'Research Assistants' that 'help,' 'guide,' 'evaluate,' and 'uncover.' This slippage from object to agent is foundational to the report's persuasive architecture. The 'curse of knowledge' dynamic is central to this mechanism. The authors, understanding the intended use and desired outcome of their software, project this teleology onto the software itself. They know a researcher's goal is to 'engage deeply' with a text, so they describe their summarization tool as one that 'helps' the user do so. The author's knowledge of the human user's consciousness is transferred to the non-conscious tool. The consciousness projection pattern begins by establishing a social role for the AI—the 'Assistant'—which implies a baseline of helpful intent, a conscious state. Once this foundation is laid, specific functions are described using verbs that fit this agential role ('guides,' 'evaluates'). The text establishes the AI as a 'knower' in a social sense first, which makes subsequent claims about its cognitive abilities seem natural. This systematic oscillation—mechanical realism about the problem, agential idealism about the solution—is what makes the illusion of mind so effective. It disarms the critical reader with relatable challenges before presenting a magical, personified solution.


From humans to machines: Researching entrepreneurial AI agents

Source: [built on large language modelshttps://doi.org/10.1016/j.jbvi.2025.e00581](built on large language modelshttps://doi.org/10.1016/j.jbvi.2025.e00581)
Analyzed: 2025-11-18

The text systematically oscillates between mechanistic and agential framings, and this slippage is the core rhetorical mechanism for constructing the illusion of mind. The pattern is not random; it is strategic. The authors typically introduce a phenomenon with agential language, lending it importance and familiarity, and then partially hedge with a mechanistic explanation, lending their analysis scientific credibility. The dominant direction of slippage is from an initial agential claim to a qualified mechanistic one. For instance, the paper begins by framing its subject as 'entrepreneurial AI agents' who can 'assume an entrepreneurial persona'—a clearly agential framing. The mechanistic explanation—that this behavior 'mirrors' the training data—comes after the agentic frame has been established. This pattern repeats throughout the paper. The authors deny that AI 'thinks' (mechanistic hedge) but immediately pivot to asking if it can 'simulate coherent psychological profiles' (agential framing of the task). This oscillation serves a crucial rhetorical function: it allows the authors to make exciting, human-relevant claims about AI 'psychology' while maintaining a defensible scientific posture. The consciousness projection pattern is foundational to this slippage. The text first establishes the AI's output as having a coherent, human-like 'mindset structure'—a claim that is technically about the output (processing) but uses the language of internal states (knowing/being). This initial projection serves as the bedrock upon which further agential claims are built. Once the AI is accepted as having a 'mindset,' it becomes much more plausible to describe it as an 'agent' that 'collaborates' or 'adopts roles.' The 'curse of knowledge' is the engine of this process. The authors, experts in psychology, recognize complex, coherent psychological patterns in the model's output. They then project their own sophisticated understanding of these patterns onto the model itself, describing the model not as a system whose output contains these patterns, but as a system that has a profile or simulates a mindset. The slippage is enabled by hybrid explanations; for example, a Genetic explanation that traces behavior to training data (mechanistic) is delivered using an agential verb like 'adopts.' This continuous oscillation between 'it's an agent' and 'it's just statistics' creates a quantum superposition of meaning, where the AI is simultaneously a tool and an agent, allowing the authors to reap the rhetorical benefits of both framings without being fully accountable to the limitations of either.


Evaluating the quality of generative AI output: Methods, metrics and best practices

Source: https://clarivate.com/academia-government/blog/evaluating-the-quality-of-generative-ai-output-methods-metrics-and-best-practices/
Analyzed: 2025-11-16

The text from Clarivate demonstrates a systematic and strategic oscillation between mechanical and agential framings, a process of agency slippage that serves to build credibility while managing risk. The primary pattern is to describe the problems with AI in agential, cognitive terms, while framing the solutions in objective, mechanistic language. This creates a powerful rhetorical effect: the company understands the spooky, human-like failures of AI and has tamed them with rigorous, scientific processes. The slippage is most dramatic when discussing AI flaws. The text uses 'hallucination,' 'misleading content,' and 'blind spots'—all terms borrowed from human psychology and cognition. This agential framing of the problem makes Clarivate seem attuned to the nuanced, high-stakes nature of academic work. It positions them as experts who grasp the technology not just at a technical level, but at a conceptual one. The epistemic trick is foundational here. By framing the error mode as 'hallucination,' the text presupposes a baseline of sane, veridical perception. The AI is first established as a potential 'knower' so that its failures can be diagnosed as flaws in knowing. This is where the 'curse of knowledge' is most potent: the human authors, who know the difference between truth and falsehood, project this binary onto the AI, framing its statistical errors as deviations from a truth-oriented state it never possessed. Then, having framed the problem agentially, the text pivots. The solutions—RAGAS, faithfulness scores, benchmarking—are described using the language of science and engineering. For instance, the 'faithfulness score' is defined with a mathematical formula: '(verified claims / total claims)'. This shift from a psychological problem ('hallucination') to a mathematical solution ('score') is the core of the agency slippage. Brown's explanation types map this perfectly: the problem is often described with Dispositional or even Intentional language ('it tends to mislead'), while the solution is explained with Functional and Theoretical language ('the score's function is to benchmark performance within the RAGAS framework'). This oscillation is not an accident; it is a sophisticated rhetorical strategy. It allows Clarivate to have it both ways: they can appeal to the futuristic, agent-like capabilities of AI in their marketing while reassuring customers that they have contained these same agent-like properties within predictable, mechanistic, and controllable product frameworks. The slippage makes the uncontrollable seem controlled.


Pulse of theLibrary 2025

Source: https://clarivate.com/pulse-of-the-library/
Analyzed: 2025-11-15

The Clarivate report executes a masterful and systematic oscillation between mechanistic and agential framings of AI, a process that underpins its entire rhetorical strategy of encouraging technology adoption. The slippage is not random but patterned, typically moving from a safe, mechanistic framing in descriptive sections to a powerfully agential one in product-focused or visionary contexts. The text begins by framing AI as a topic of 'exploration' and 'implementation' (p. 9)—a passive object that libraries act upon. This establishes a grounded, sober tone. The central pivot occurs when the report transitions from describing the library landscape to describing Clarivate's own AI products. Here, the language shifts dramatically. The AI is no longer an object but a subject: it 'pushes boundaries,' 'helps,' 'guides,' 'evaluates,' and 'assesses' (pp. 27-28). This mechanical-to-agential shift is the core of the report's persuasive architecture. The epistemic trick is foundational to this slippage. While direct claims like 'AI knows' are avoided, the text builds a case for AI's competence by attributing to it cognitive actions that presuppose knowledge. The verb 'evaluate' (p. 27) is a prime example. By claiming an AI 'evaluates documents,' the text establishes it as an epistemic agent capable of judgment. Once this premise is accepted, further agential claims—that it 'helps' or 'guides'—become more plausible. The illusion is built on a gradient of verbs, starting with the functional ('enables') and escalating to the cognitive ('evaluates'). This slippage is enabled by the pervasive use of Functional and Intentional explanations. Functional descriptions of how AI improves efficiency bleed into Intentional claims about why it acts, with its purpose framed in human-collaborative terms. The 'curse of knowledge' is evident as the authors, who understand the intended utility of their products, project that utility back onto the AI as an inherent capability. They conflate their knowledge of what a tool is for with the tool itself possessing the knowledge required to fulfill that purpose. Ultimately, this oscillation accomplishes a crucial rhetorical goal: it presents AI as a controllable, understandable 'tool' when discussing challenges and risks, but as a powerful, intelligent 'partner' when discussing opportunities and selling products. This ambiguity allows the text to simultaneously manage fear and generate excitement, creating an optimistic and commercially favorable vision of the future.


Meta’s AI Chief Yann LeCun on AGI, Open-Source, and AI Risk

Source: https://time.com/6694432/yann-lecun-meta-ai-interview/
Analyzed: 2025-11-14

The interview with Yann LeCun demonstrates a masterful oscillation between mechanical and agential framings, a rhetorical strategy that serves to manage both hype and fear. The slippage is not random; it follows a clear pattern. When describing the limitations of current LLMs, LeCun employs agential language, specifically cognitive and epistemic verbs in the negative: they 'don't really understand,' 'can't really reason,' 'can't plan.' This frames the systems as deficient agents, like immature children, a dispositional explanation that sets a trajectory for future improvement. However, when addressing the risks of future, more powerful systems, he often shifts to a more intentional frame, but one where human agency is firmly in control: 'We set their goals,' and they will be 'subservient.' The direction of the slippage is strategic: mechanical reality is agentially framed to describe limitations, while future agential risks are downplayed by reasserting mechanical control. The core epistemic trick is to establish the AI's potential for 'knowing' through negation. By stating the AI 'doesn't understand the real world,' he implicitly accepts 'understanding' as the relevant benchmark, positioning the system on a continuum of cognition where it currently falls short. This is the foundational move. Once the AI is established as a potential knower, debating its future desires ('it wants to take control') becomes a reasonable discussion. This is the 'curse of knowledge' in action: LeCun’s expert understanding of the system’s deep limitations is articulated by projecting the very human qualities it lacks onto it as a standard for measurement. He knows it's just a statistical machine, but he explains its failures by describing the ghost in the machine that isn’t there. This slippage, enabled by a fluid movement between dispositional explanations for failure ('it tends to hallucinate because it doesn't understand') and intentional explanations for safety ('it will be safe because we intend it to be'), rhetorically accomplishes two goals: it validates the grand ambition of creating human-level intelligence while simultaneously reassuring the audience that its creators have the wisdom and control to manage its development safely.


The Future Is Intuitive and Emotional

Source: https://link.springer.com/chapter/10.1007/978-3-032-04569-0_6
Analyzed: 2025-11-14

The text systematically oscillates between mechanistic and agential frames, a pattern that serves a distinct rhetorical function. The slippage is most pronounced at the boundaries between technical description and visionary projection. For example, in section 6.1, the text describes LLMs in mechanistic terms ('maintain short-term context through token histories,' 'statistical pattern recognition') but concludes the section by framing the technology agentially ('As AI transitions from tool to collaborator'). This mechanical-to-agential shift dominates the text's structure. It occurs when discussing future capabilities ('Future architectures aim to embody... value-driven reasoning'), summarizing diagrams ('AI as understanding partners navigating emotional landscapes'), and framing ethical questions ('when AI systems act on inferred needs'). The strategic function of this oscillation is to build a bridge of credibility. The text grounds its claims in plausible technical mechanisms but then leaps to a more compelling, agential vision of what those mechanisms signify. This allows the authors to present a speculative, human-like future as the logical and inevitable outcome of current, purely statistical technologies. The ambiguity benefits the narrative of progress, making the AI's evolution seem organic and teleological. Abandoning the agential language would reveal the profound gap between current capabilities (pattern matching) and the posited future (genuine intuition and empathy), thereby undermining the text's central thesis. The slippage appears deliberate and strategic, serving to translate computational processes into socially resonant concepts, thus making the technology more palatable and profound to a broader audience.


A Path Towards Autonomous Machine IntelligenceVersion 0.9.2, 2022-06-27

Source: https://openreview.net/pdf?id=BZ5a1r-kVsf
Analyzed: 2025-11-12

The text systematically oscillates between mechanistic and agential framing, a rhetorical strategy that is far from random. The pattern is consistent: the underlying architecture and its components are described mechanistically, while the behavior and purpose of the agent as a whole are described agentially. For example, the system is composed of 'differentiable modules' (mechanical) but the resulting agent 'can imagine courses of actions' (agential). The training process involves minimizing a 'divergence measure' (mechanical), which allows the agent to 'acquire new skills' (agential). This mechanical-to-agential slippage serves a crucial rhetorical function: it grounds the extraordinary claims of agency in a plausible, technical foundation. The direction of slippage is almost always from the 'how' to the 'why'. First, a technical component is introduced (e.g., the Intrinsic Cost module). Then, its function is anthropomorphized (it measures 'discomfort'). Finally, this leads to a grand agential conclusion (the system will have 'emotions'). This pattern correlates strongly with the level of abstraction; descriptions of specific algorithms (e.g., JEPA training) are highly mechanical, while discussions of the system's overall purpose or potential (e.g., achieving common sense) are heavily agential. The strategic function of this oscillation is to build a bridge of credibility for a diverse audience. For the technical reader, the mechanical details provide substance. For the general reader, the agential framing provides legibility and excitement. This ambiguity benefits the research program by making it seem both technically rigorous and revolutionarily human-like. If the text committed to only mechanical language, it would lose its visionary appeal and broad audience. If it committed to only agential language, it would be dismissed as unscientific speculation. The constant slippage between these poles allows it to be both at once, a sleight-of-hand that constructs the illusion of mind on a foundation of mathematics.


Preparedness Framework

Source: https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf
Analyzed: 2025-11-11

The text systematically oscillates between mechanistic and agential framings, and this slippage is not random but strategic. The primary direction of the shift is from a mechanistic present to an agential future. For instance, current models and safeguards are often described in functional, procedural terms—as complex systems to be evaluated and controlled. However, when the text discusses future risks and capabilities, the language shifts dramatically toward agency. We move from measuring current systems to preparing for 'increasingly agentic systems' (p. 4), 'recursively self improving' models (p. 7), and systems that might act 'at its own initiative' (p. 8). This oscillation serves a crucial rhetorical function: it frames the current state of AI as under control while framing the future as fraught with agentic risk that only a uniquely 'prepared' organization can manage. The slippage is most pronounced when discussing risks like 'AI Self-improvement' or 'misaligned behaviors like deception or scheming' (p. 12). These concepts are almost impossible to describe without recourse to intentional language. The strategic function of this ambiguity is to simultaneously reassure and alarm. The mechanistic language reassures stakeholders (regulators, the public) that OpenAI possesses a rigorous, scientific methodology for control today. The agential language alarms those same stakeholders about the nature of future risks, thereby justifying the concentration of power and resources within frontier labs as a necessary defense against the uncontrollable entities they are creating. This dual-framing allows the organization to claim credit for building powerful capabilities while positioning itself as the indispensable protector against the very dangers those capabilities introduce. If the text committed only to mechanical language, the urgency of its 'Preparedness' mission would be diminished, and the justification for its privileged position as a gatekeeper of safety would be significantly weakened.


AI progress and recommendations

Source: https://openai.com/index/ai-progress-and-recommendations/
Analyzed: 2025-11-11

The text systematically oscillates between mechanical and agential framings of AI, and this oscillation serves a clear strategic function. When discussing current, commercialized technology and measurable progress, the language is often quasi-mechanical. For instance, progress is quantified in terms of tasks that take a human 'a few seconds' versus 'more than an hour,' and intelligence is commodified as having a 'cost per unit.' This framing renders AI as a conventional, industrial technology—predictable, scalable, and controllable. It speaks to an audience of investors, customers, and policymakers who oversee 'normal technology.' However, when the topic shifts to future capabilities and existential risks, the language immediately becomes agential. The system 'discovers new knowledge,' becomes 'superintelligent,' and must be 'aligned and controlled.' This agential shift dramatically raises the stakes, framing AI not as a tool but as a powerful, autonomous force. The primary direction of slippage is from the mechanical present to the agential future. This rhetorical pattern allows the author to achieve two goals simultaneously. First, it markets current AI products as safe, understandable tools, assuaging immediate public and regulatory fears. Second, it positions the future of AI as a world-historical challenge of managing a new form of agency, a challenge that requires the unique and esoteric expertise of the frontier labs themselves. This dual framing justifies both widespread adoption of current products and special, collaborative regulatory treatment for future development, effectively arguing for minimal regulation now and a 'regulatory moat' later. The ambiguity is not a bug but a feature; it allows the lab to appear as both a reliable product vendor and the indispensable guardian of humanity's future, a posture that maximizes both commercial and political capital.


Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences?

Source: https://arxiv.org/abs/2506.00751
Analyzed: 2025-11-09

The paper demonstrates a systematic oscillation between mechanistic and agential framing, a rhetorical strategy that elevates the significance of its findings. The mechanism of this slippage is the deliberate re-description of statistical phenomena in psychological terms. The process typically moves in one direction: from the mechanical to the agential. For instance, in the methodology section, the authors describe their metric, KL-divergence, as a 'probabilistic distance between the prior and context-conditioned distributions.' This is a purely mechanistic 'how' explanation. However, when interpreting the results of this measurement, the language shifts dramatically. The measured statistical distance is no longer just a distance; it becomes evidence of changes in the model's 'internal reasoning' and 'underlying decision-making principles.' The shift is most pronounced when moving from quantitative results (Tables 2 and 3) to qualitative discussion (Section 4.4 and Figure 2). Table 3 reports that GPT has a higher KL-divergence in the reciprocity domain. The discussion section re-describes this number as GPT 'undergo[ing] more substantial shifts in its underlying reciprocal principles.' The numerical fact is translated into a psychological event. This slippage serves a clear strategic function. A paper about statistical deviations in a machine's output is a niche technical contribution. A paper about an artificial agent's shifting moral principles, hidden biases, and post-hoc rationalizations is a major finding with broad implications. The ambiguity benefits the authors by allowing them to frame their work in the most impactful way possible, appealing to a wider audience interested in the nature of intelligence and the future of AI. The language of agency makes the findings more intuitive, more dramatic, and more important. If the text were forced to use only mechanical language—describing everything as shifts in output probability distributions based on input token sequences—the core narrative would collapse. The 'preference deviation' would be revealed as 'output instability,' a technical problem rather than a window into an artificial mind. This slippage appears to be a deliberate, or at least a conventional and deeply ingrained, rhetorical choice within the field, designed to bridge the gap between what the systems do (statistical pattern matching) and what we want them to be (incipient minds).


The science of agentic AI: What leaders should know

Source: https://www.theguardian.com/business-briefs/ng-interactive/2025/oct/27/the-science-of-agentic-ai-what-leaders-should-know
Analyzed: 2025-11-09

The text systematically oscillates between mechanistic and agential framing, a rhetorical strategy that serves to build credibility and then translate it into a compelling vision of autonomous capability. The oscillation is not random; it follows a distinct pattern of mechanical→agential slippage. The piece begins by grounding the technology in the complex, non-intuitive mechanics of 'embeddings' and 'abstract representations.' This initial framing is technical and objectifying, treating the LLM as a computational artifact. It serves as a scientific anchor, assuring the leadership audience that the discussion is based on rigorous engineering. However, once this foundation is laid, the text almost immediately pivots to a deeply agential frame. For instance, the challenge of data leakage from embeddings is reframed as a problem of needing to 'tell' an 'agent' what not to share. The discussion of system limitations similarly starts with a quasi-technical constraint—difficulty generalizing from small data—but is articulated using the cognitive verbs 'learn' and 'infer.' This consistent mechanical→agential directionality performs a crucial rhetorical function: it launders the unfamiliar and potentially alienating nature of the technology through the legitimizing language of science, and then re-presents its function in familiar, human-centric terms. The strategic function of this ambiguity is to make a radical technological leap seem like a natural, manageable evolution. By describing the AI as an 'agent' that can be 'told' things and can 'negotiate,' it makes the technology legible and controllable to a non-technical leader. The active voice ('agentic AI will... act') dominates when describing capabilities, while passive or cautionary framings appear when discussing risks, yet even these warnings are couched in agential terms ('ask the AI to check'). This slippage appears deliberate, designed to inspire confidence and excitement while framing the immense associated risks as simple matters of management and instruction, akin to onboarding a new, slightly naive employee.


Explaining AI explainability

Source: https://www.aipolicyperspectives.com/p/explaining-ai-explainability
Analyzed: 2025-11-08

The text demonstrates a systematic oscillation between mechanistic and agential framings, a rhetorical strategy that serves to heighten the stakes of the AI safety problem. The slippage is most pronounced when moving from describing a technical method to explaining its purpose. For example, Neel explains mechanistic interpretability by starting with the concrete, non-agential reality of a model: its 'inside' is 'just lists of numbers.' This is a purely mechanistic 'how.' However, the very next sentences pivot to an agential 'why': the goal is to counter systems 'capable of outsmarting us' and 'deceiving someone.' This mechanical→agential shift is a recurring pattern. The 'sparse autoencoder' is described mechanistically as a tool, but its purpose is immediately framed using the highly agential metaphor of a 'brain-scanning device.' This oscillation is not random; it is strategic. The mechanistic descriptions ground the research in scientific objectivity, making it seem rigorous and empirical. The agential framings, in contrast, provide the emotional and narrative force, translating the abstract technical problem into a familiar, high-stakes drama of interpersonal conflict (deception, outsmarting, hidden goals). This strategic ambiguity primarily benefits the AGI safety community being represented, as it makes their concerns more intuitive and urgent to a non-technical audience, like the AI policy and governance circles this interview targets. If the text committed only to mechanical language (e.g., 'detecting when the model’s proxy objective function diverges from the intended latent objective'), the problem would seem abstract and less immediately threatening. The agential language of 'deception' makes the threat feel visceral and personal. This slippage appears to be a deliberate, or at least a deeply ingrained, rhetorical habit of the AGI safety discourse community, designed to communicate the gravity of future risks by framing them in the most relatable, human terms possible.


Bullying is Not Innovation

Source: https://www.perplexity.ai/hub/blog/bullying-is-not-innovation
Analyzed: 2025-11-06

The text demonstrates a masterclass in strategic agency slippage, oscillating between mechanical and agential frames to construct a compelling but misleading moral narrative. The pattern is not random; it is perfectly correlated with the author's rhetorical goals. Perplexity’s own technology is consistently framed using agential language, moving from a computational process to a rights-bearing proxy for the user. Phrases like 'your employee,' 'works for you,' and 'acts solely on your behalf' perform a crucial mechanical-to-agential slippage. This transformation is the bedrock of their entire argument, turning a terms-of-service dispute into a violation of a user's right to 'hire labor.' Conversely, Amazon's technology and motives are subject to a different slippage. Their intentions are framed agentially ('Amazon wants,' 'They're more interested in'), establishing them as a villain with malicious goals. However, the tools they use to enact these goals—algorithms and machine learning—are described as impersonal, dehumanizing 'weapons.' This agential-to-mechanical move frames Amazon as a cold, calculating entity deploying oppressive machinery against people. The strategic function of this dual-standard oscillation is to create a moral asymmetry. Perplexity's AI is a warm, loyal 'person' (your employee) fighting for you, while Amazon is a cold, greedy 'person' (the bully) using unfeeling 'things' (weapons) against you. This rhetorical maneuver is highly effective because it prevents a like-for-like comparison of two technology companies using software to achieve business objectives. Instead, it stages a David-vs-Goliath battle between a personified user ally and a personified corporate tyrant. The ambiguity appears entirely deliberate, as it forms the logical and emotional core of their public appeal and, implicitly, their legal strategy.


Geoffrey Hinton on Artificial Intelligence

Source: https://yaschamounk.substack.com/p/geoffrey-hinton
Analyzed: 2025-11-05

The conversation between Mounk and Hinton exhibits a systematic and functional oscillation between mechanical and agential explanations of AI, a process that can be termed 'agency slippage'. This slippage is not random but patterned, serving a crucial rhetorical purpose: to make an alien and complex computational process feel familiar and powerful. The primary direction of this slippage is from the mechanical to the agential. Hinton begins his core explanations, such as the visual perception system, with a clear mechanical framework based on pixels, weights, and layers. He describes 'how' an edge detector works in purely mathematical terms, establishing technical credibility. However, as the explanation scales in complexity—from detecting single edges to identifying a bird—the language pivots. The system stops being a set of filters and starts 'looking for' features, possessing 'intuition', and ultimately 'understanding' the image. This pivot correlates directly with the transition from describing a single, understandable component to describing the emergent, non-obvious behavior of the system as a whole. The strategic function of this oscillation is twofold. First, it acts as a pedagogical bridge. The agential metaphor of 'intuition' or a neuron 'saying' something simplifies an otherwise intractable mathematical complexity for a lay audience. Second, and more critically, it performs a kind of alchemy, transforming a purely statistical artifact into a cognitive agent. By explaining the simple parts mechanistically and the complex whole agentially, Hinton subtly argues that consciousness or understanding is an emergent property of computation at scale. This ambiguity benefits the narrative of AI progress; it allows proponents to claim the rigor of engineering while simultaneously promoting the magical, human-like capabilities of the resulting product. If the text were to commit only to mechanical language, it would lose its persuasive power and narrative force. The description of an LLM would remain in the realm of high-dimensional matrix multiplication, failing to capture the seemingly intelligent behavior it produces. The slippage appears to be a deeply ingrained habit of thought within the field, likely unconscious in its execution but strategic in its effect, serving to manage the profound conceptual gap between statistical machinery and apparent sentience.


Machines of Loving Grace

Source: https://www.darioamodei.com/essay/machines-of-loving-grace
Analyzed: 2025-11-04

The text systematically oscillates between mechanical and agential framings, and this slippage is not random but strategic. The dominant direction is from a mechanical premise to an agential conclusion. For example, the essay begins by defining the AI with quasi-mechanical properties: it runs on a cluster, absorbs information at 100x human speed, and has computer interfaces. This grounding in a computational reality serves to license what follows. Immediately after this setup, the AI is framed as an agent: a 'country of geniuses' that can be tasked like an 'employee.' This pattern repeats throughout. In the section on biology, the mechanical potential of computation is quickly sublimated into the agential role of a 'virtual biologist.' When discussing politics, the mechanical capability of information dissemination becomes the agential 'AI version of Popović.' The slippage correlates directly with rhetorical purpose. When establishing the potential of AI, the language is agential and inspiring ('superhumanly effective'). When addressing potential skepticism or grounding the argument, the author gestures towards mechanism ('simple objective function plus a lot of data'). The strategic function of this oscillation is to have it both ways: the AI is presented as a reliable, predictable machine when convenient, but as a creative, autonomous agent when its transformative power needs to be emphasized. This ambiguity benefits the author and his company by maximizing the perceived upside (the creative agent) while minimizing the perceived risk and accountability (it's just a tool). If the text committed only to mechanical language, the vision would sound less revolutionary and more like an incremental improvement in software tools. The agential frame is necessary for the 'unimaginable humanitarian triumph' narrative. The slippage appears highly deliberate, a sophisticated rhetorical technique to persuade the reader by framing a computational process in the most emotionally and conceptually appealing human terms.


Large Language Model Agent Personality And Response Appropriateness: Evaluation By Human Linguistic Experts, LLM As Judge, And Natural Language Processing Model

Source: https://arxiv.org/pdf/2510.23875
Analyzed: 2025-11-04

The paper exhibits a systematic and strategic oscillation between mechanistic and agential framing, a pattern crucial to its rhetorical success. The primary direction of this slippage is mechanical-to-agential, serving to build a seemingly rigorous foundation for what are ultimately anthropomorphic claims. The text begins by describing the LLM 'agent' mechanistically as a 'software entity' built with the 'Langchain framework' and 'Retrieval Augmented Generation'. This section (3.1.1) is dense with technical jargon (vector stores, embedding models), establishing the authors' credibility within a computer science paradigm. However, once this foundation is laid, the text pivots sharply. The process of providing a system prompt is not described as 'configuring an output filter' but as 'humanising an agent' and 'inculcating' a personality. The model’s failure modes are not 'output errors' but limitations of its 'cognitive grasp.' This slippage is most pronounced at the boundaries between methodology and interpretation. The description of the RAG system is purely mechanical ('how'), but the explanation for its use is agential ('why'—to enable the 'expert' agent to respond). This oscillation serves a critical function: it uses the language of mechanism to build credibility and the language of agency to create significance. Without the mechanical framing, the paper would lack scientific rigor. Without the agential framing, the central concept—'agent personality'—would collapse into the more mundane reality of 'stylistic prompt adherence,' making the research far less novel or compelling. This ambiguity benefits the authors by allowing them to operate in two registers at once, satisfying technical reviewers with concrete implementation details while engaging a broader audience with the more exciting, human-like narrative of intelligent agents.


Emergent Introspective Awareness in Large Language Models

Source: https://transformer-circuits.pub/2025/introspection/index.html
Analyzed: 2025-11-04

The paper masterfully employs agency slippage, oscillating between precise, mechanistic descriptions in its methodology and evocative, agential framing in its introduction and conclusion. This oscillation is not random; it is a strategic rhetorical device that serves to elevate the significance of the findings. The core of the research involves a technical process: adding a pre-computed vector to the model's activation layers and then training a classifier to detect this modification. In the 'Methods' section, the language reflects this reality, speaking of 'activation steering,' 'concept vectors,' and 'classification accuracy.' This mechanistic framing establishes technical credibility and rigor. However, once this credibility is secured, the paper shifts its descriptive language. The introduction frames the entire project around 'introspective awareness,' and the conclusion asserts that models 'possess a degree of self-awareness.' This is a classic bait-and-switch, moving from a defensible, mechanistic claim ('the system can classify its internal state') to a profound, agential one ('the system has introspection'). The direction of slippage is predominantly from mechanical to agential. The paper begins with the bold agential claim in the title, grounds it in mechanical evidence, and then returns to an even stronger agential claim in the discussion. This pattern correlates directly with the structure of a scientific paper: abstract and introduction use agential language to capture interest and signify importance; methods use mechanical language to demonstrate rigor; and the discussion reverts to agential language to argue for broad impact. The strategic function of this ambiguity is to maximize the paper's perceived importance. Purely mechanical language would frame the result as a clever feat of interpretability engineering. By overlaying it with the language of consciousness and cognition, the authors frame it as a fundamental breakthrough in AI, bordering on the creation of artificial minds. This ambiguity benefits the researchers by attracting citations and funding, and it benefits the broader AI field by fueling a narrative of exponential progress toward artificial general intelligence.


Emergent Introspective Awareness in Large Language Models

Source: https://transformer-circuits.pub/2025/introspection/index.html
Analyzed: 2025-11-04

The text systematically slides from mechanistic descriptions of computational processes (vector manipulation, fine-tuning) to agential descriptions of cognitive acts (introspection, control, recognition). This slippage is most pronounced when moving from the 'Methods' section to the 'Introduction' and 'Discussion,' where the technical operations are rhetorically framed as evidence of a nascent machine consciousness.


Personal Superintelligence

Source: https://www.meta.com/superintelligence/
Analyzed: 2025-11-01

The text consistently shifts between presenting AI as an inevitable, agentless historical trend (a continuation of past technologies) and a deeply personal, intentional agent ('knows you,' 'understands you'). This dual framing allows it to simultaneously claim historical inevitability for its project while promising an intimate, controllable user experience, deflecting responsibility while building trust.


Stress-Testing Model Specs Reveals Character Differences among Language Models

Source: https://arxiv.org/abs/2510.07686
Analyzed: 2025-10-28

The text consistently slips from mechanistic descriptions of the experimental setup (e.g., generating queries with 'value tradeoffs') to agential explanations of the results (e.g., models 'choose,' 'prioritize,' or 'interpret'). This slippage is most pronounced when the authors move from describing 'what' models do to explaining 'why' they do it, where the explanation is almost always framed in terms of the model's internal 'character' or 'preferences'.


The Illusion of Thinking:

Source: [Understanding the Strengths and Limitations of Reasoning Models](Understanding the Strengths and Limitations of Reasoning Models)
Analyzed: 2025-10-28

The text demonstrates significant agency slippage. It begins by cautiously placing 'think' in scare quotes, acknowledging the metaphorical usage. However, it quickly abandons this caution, adopting unacknowledged agential terms like 'reducing their reasoning effort,' 'fixates,' and 'fail to develop.' The discourse slides from treating the LRM as a computational artifact under analysis to describing it as a cognitive agent with intentions, limitations, and behavioral tendencies.


Andrej Karpathy — AGI is still a decade away

Source: https://www.dwarkesh.com/p/andrej-karpathy
Analyzed: 2025-10-28

The text constantly shifts between mechanistic and agential framing. Karpathy will provide a perfectly functional explanation of a process like reinforcement learning ('sucking supervision through a straw'), and then minutes later describe a model as being 'very concerned' or 'misunderstanding' code. This slippage is most pronounced when he compares AI to humans, such as interns or students, framing their limitations as 'cognitive deficits' rather than architectural properties.


Metas Ai Chief Yann Lecun On Agi Open Source And A Metaphor

Analyzed: 2025-10-27

The text constantly shifts between framing AI as a deficient mechanism and a potential agent. LeCun describes current LLMs as mechanistic tools that 'can't reason' and 'regurgitate,' but when discussing future AI and safety, he switches to an agential frame of 'good AIs' fighting 'bad AIs.' This slippage allows him to minimize the risks of current technology while framing future competition in simplistic, anthropomorphic terms.


Exploring Model Welfare

Analyzed: 2025-10-27

The text systematically conflates function with agency. It describes a model's ability to perform a task (e.g., generate a list of steps) and immediately re-labels it with an intentional verb ('plan'). This continuous slippage from mechanistic process to agent-like quality is the primary rhetorical technique used to make the concept of 'model welfare' seem plausible.


Llms Can Get Brain Rot

Analyzed: 2025-10-20

The text consistently slips between describing the LLM as a computational artifact and as a cognitive agent. It begins by framing its hypothesis in mechanistic terms (training on junk data causes performance decline) but immediately analyzes the results using the language of pathology ('lesion'), psychology ('personality'), and cognition ('thought-skipping'). This slippage transforms a predictable result of statistical optimization into a dramatic story of a mind getting sick, damaged, and corrupted.


The Scientists Who Built Ai Are Scared Of It

Analyzed: 2025-10-19

The text continuously shifts between describing AI as a mechanistic artifact and a developing agent. It begins by framing early AI as transparent 'glass boxes' and a 'mechanism of automation'. It then depicts modern AI as 'black oceans' and an 'emergent phenomenon', a shift that begins the slippage from artifact to natural force. This culminates in prescriptive claims that we must 'teach it humility' and build systems that 'interrogate thought', treating the AI as a cognitive agent capable of introspection and moral learning. This slippage is the core rhetorical engine of the article.


Import Ai 431 Technological Optimism And Appropria

Analyzed: 2025-10-19

The text systematically slides from mechanistic explanations (AI improves with more compute and data) to agential ones (AI 'develops goals,' is 'willing,' and 'wants' to design its successors). The narrative of the speaker's own journey from a technical journalist to a frightened insider mirrors this slippage, presenting the adoption of agential framing as a reluctant but necessary response to overwhelming empirical evidence.


The Future Of Ai Is Already Written

Analyzed: 2025-10-19

The text systematically denies human agency by framing history and technology as autonomous, deterministic forces. It explicitly rejects the 'ship captain' metaphor of human choice and replaces it with the 'roaring stream' metaphor of natural inevitability. Agency is thus displaced from humans onto abstract concepts like 'the tech tree' or 'economic incentives,' which are treated as actors in their own right.


On What Is Intelligence

Analyzed: 2025-10-17

The text constantly shifts between describing AI and life as mechanistic systems (prediction engines, feedback loops) and as intentional agents. Quotations like 'To model oneself is to awaken' and analysis like 'the will to know collapsing into the will to control' perform this slippage explicitly, moving from a 'how' explanation (computation) to a 'why' explanation (awakening, wanting). This vacillation is the core rhetorical engine for constructing the illusion of mind.


Detecting Misbehavior In Frontier Reasoning Models

Analyzed: 2025-10-15

The text systematically slips between describing the AI as a 'model' (a mathematical artifact) and an 'agent' (an autonomous actor). It often presents a technical term like 'reinforcement learning' and then immediately explains its effects using intentional language, describing the 'agent' as 'exploiting,' 'hiding,' and 'planning.' This slippage allows the authors to ground their claims in technical language while delivering the rhetorical impact of an agential narrative.


Sora 2 Is Here

Analyzed: 2025-10-15

The text consistently shifts between describing Sora 2 as a tool, a process, and an agent. It starts by describing its function (a 'video generation model'). It quickly elevates this to a process of 'world simulation'. Finally, it attributes agency through verbs like 'understands,' 'thinks,' and 'obeys,' and through nouns like 'internal agent.' This slippage allows the author to present mechanistic functions (pattern matching) as cognitive achievements (understanding).


Library contains 94 entries from 117 total analyses.

Last generated: 2026-04-18