Friday, July 03, 2026

Anatomy of Massive Activations and Attention Sinks

This could be an interesting paper by Yann LeCun and his team!

From the abstract:
"We study two recurring phenomena in Transformer language models:
massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and 
attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance.
Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationships remain unclear. Through systematic experiments, we show that the 
co-occurrence is largely an architectural artifact of modern Transformer design, and that 
the two phenomena serve related but distinct functions
Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. 
Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies.
We identify the pre-norm configuration as the key choice that enables the co-occurrence and show that ablating it causes the two phenomena to decouple."

Anatomy of Massive Activations and Attention Sinks | OpenReview (open access)

No comments: