Common Sense: A New AI Approach to Rapidly Compress Large Language Models without a Significant Loss of Quality

Saturday, April 12, 2025

A New AI Approach to Rapidly Compress Large Language Models without a Significant Loss of Quality

Good news! Amazing how a Russian company, MIT, and the Saudi KAIST can do research together in these times. 😊

HIGGS — the innovative method for compressing large language models was developed in collaboration with teams at Yandex Research, MIT, KAUST and ISTA.
HIGGS makes it possible to compress LLMs without additional data or resource-intensive parameter optimization.
Unlike other compression methods, HIGGS does not require specialized hardware and powerful GPUs. Models can be quantized directly on a smartphone or laptop in just a few minutes with no significant quality loss.
The method has already been used to quantize popular LLaMA 3.1 and 3.2-family models, as well as DeepSeek and Qwen-family models.

The Yandex Research (Russian company) team, together with researchers from the Massachusetts Institute of Technology (MIT), the Austrian Institute of Science and Technology (ISTA) and the King Abdullah University of Science and Technology (KAUST), developed a method to rapidly compress large language models without a significant loss of quality. ..."

From the abstract:

"The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices.

In this paper, we revisit the problem of "extreme" LLM compression-defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter-from the point of view of classic methods in Multi-Codebook Quantization (MCQ).

Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations:

1) learned additive quantization of weight matrices in input-adaptive fashion, and

2) joint optimization of codebook parameters across each transformer blocks.

Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime.

In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint."

LLMs No Longer Require Powerful Servers: Researchers from MIT, KAUST, ISTA, and Yandex Introduce a New AI Approach to Rapidly Compress Large Language Models without a Significant Loss of Quality - MarkTechPost

Hugging Face Higgs

Extreme Compression of Large Language Models via Additive Quantization (open access, presented at ICML 2024)