The Yandex Research AI lab, in collaboration with scientific and technological universities such as HSE University, MIT, KAUST, and ISTA, has developed a method for quickly compressing large language models (LLM) without losing quality. The developers emphasized that now, a smartphone or laptop is sufficient to work with the models, eliminating the need for expensive servers and powerful GPUs.
The method allows for quick testing and implementation of new solutions based on neural networks, saving time and money on development. This makes LLMs more accessible not only to large companies but also to small companies, non-profit laboratories and institutions, individual developers, and researchers.
Yandex explained:
Previously, to run a language model on a smartphone or laptop, it was necessary to quantize it on an expensive server, which took from several hours to several weeks. Now, quantization can be performed directly on a phone or laptop in a matter of minutes.
The new quantization method is called HIGGS (from English: Hadamard Incoherence with Gaussian MSE-optimal GridS). The HIGGS method is already available to developers and researchers on Hugging Face and GitHub.
HIGGS allows you to reduce the size of the model while maintaining its quality and run it on more accessible devices. For example, using this method, you can compress even large models such as DeepSeek-R1 with 671 billion parameters and Llama 4 Maverick with 400 billion parameters, which until now could only be quantized using the simplest methods with a significant loss in quality.
The method has already been tested on popular models Llama 3 and Qwen2.5. Experiments have shown that HIGGS is the best way to quantize in terms of quality to model size among all existing quantization methods without using data, including NF4 (4-bit NormalFloat) and HQQ (Half-Quadratic Quantization).