Common Sense: Anthropic: Tracing the thoughts of a large language model

Monday, March 31, 2025

Anthropic: Tracing the thoughts of a large language model

How much do you really want to learn about the secret sauce? 😊😇

Unfortunately, Anthropic did not publish it's two new papers in the usual way for scientific review, but as self published. The first paper is very long like a book chapter or even longer than that (with lots of illustrations).

"... Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:

Claude can speak dozens of languages. What language, if any, is it using "in its head"?
Claude writes text one word at a time. Is it only focusing on predicting the next word or does it ever plan ahead?
Claude can write out its reasoning step-by-step. Does this explanation represent the actual steps it took to get to an answer, or is it sometimes fabricating a plausible argument for a foregone conclusion?

...

Today, we're sharing two new papers that represent progress on the development of the "microscope", and the application of it to see new "AI biology".

In the first paper, we extend our prior work locating interpretable concepts ("features") inside a model to link those concepts together into computational "circuits", revealing parts of the pathway that transforms the words that go into Claude into the words that come out.

In the second [paper], we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that:

Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.
Claude, on occasion, will give a plausible-sounding argument designed to agree with the user rather than to follow logical steps. We show this by asking it for help on a hard math problem while giving it an incorrect hint. We are able to “catch it in the act” as it makes up its fake reasoning, providing a proof of concept that our tools can be useful for flagging concerning mechanisms in models.

... "

Tracing the thoughts of a large language model \ Anthropic

Shared features exist across English, French, and Chinese, indicating a degree of conceptual universality.

Monday, March 31, 2025

Anthropic: Tracing the thoughts of a large language model

No comments: