Common Sense: Anatomy of Massive Activations and Attention Sinks

Friday, July 03, 2026

Anatomy of Massive Activations and Attention Sinks

This could be an interesting paper by Yann LeCun and his team!

From the abstract:

"We study two recurring phenomena in Transformer language models:

massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and

attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance.

Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationships remain unclear. Through systematic experiments, we show that the

co-occurrence is largely an architectural artifact of modern Transformer design, and that

the two phenomena serve related but distinct functions.

Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model.

Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies.

We identify the pre-norm configuration as the key choice that enables the co-occurrence and show that ablating it causes the two phenomena to decouple."

Anatomy of Massive Activations and Attention Sinks | OpenReview (open access)

Friday, July 03, 2026

Anatomy of Massive Activations and Attention Sinks

No comments: