The Statistical Correlation of Affective Vocabulary Vectors on Subsequent Token Prediction

Or why the urge to tell a psychological story about the math is (almost) irresistible..

discoursedepot projects metaphor anthropic anthropomorphism ai

Originally posted on substack.

The title is not very gripping. It doesn’t quite catch the storyable phenomenon of novelty that characterizes LLMs. But it is what the title could’ve been. But instead, they called it: Emotion Concepts and their Function in a Large Language Model.

First no shade thrown here because Anthropic is doing some incredible mechanistic interpretability. They are literally finding the mathematical shape of the word “desperate” inside the neural network. I find this super fascinating. But instead of keeping the description mechanical, they wrap it in the language of a psychological thriller. And notice how they use the phrase ”Claude’s preferences.” An optimization function does not and can not have a “preference.” Full stop.

Here’s what the article says and it is very cool for sure. They are looking inside the black box glass box and finding that words like “cry,” “tears,” “funeral,” and “sad” all activate a specific cluster of weights. When that cluster is activated, the model is statistically more likely to output pessimistic or somber text. That is mechanistically fascinating. One could even say that it is a triumph of linear algebra that we can map the topography of human language so precisely. But there’s a cognitive smuggling going on in the word ”concept.” A concept implies understanding. When a human has a concept of sadness, it is tied to their physical body, their memory, and their social and psychological reality. However, when a language model has a “vector” for sadness, it is entirely syntactic. (And that’s not trivial!) It is the shape of the word, devoid of the substance.

Yet it is a fascinating cascade of metaphor and subsequent narrative transportation that turns a statistical output into some sort of premeditated crime.

Head on over to Discourse Depot to see a Metaphor, Anthropomorphism and Explanation Audit of the article. (With a new and improved schema version for the output). I’ve also ran it through the “What Survives When the Metahor is Removed” machine.