No More Powerful Servers Needed: Scientists Achieve Breakthrough in LLM Optimization

Generated by Dall-E neural network

The method allows for quick testing and implementation of new solutions based on neural networks, saving time and money on development. This makes LLMs more accessible not only to large companies but also to small companies, non-profit laboratories and institutions, individual developers, and researchers.

Yandex explained:

Previously, to run a language model on a smartphone or laptop, it was necessary to quantize it on an expensive server, which took from several hours to several weeks. Now, quantization can be performed directly on a phone or laptop in a matter of minutes.

The new quantization method is called HIGGS (from English: Hadamard Incoherence with Gaussian MSE-optimal GridS). The HIGGS method is already available to developers and researchers onHugging Face andGitHub.

HIGGS allows reducing the size of the model while maintaining its quality and running it on more accessible devices. For example, using this method, it is possible to compress even such large models as DeepSeek-R1 with 671 billion parameters and Llama 4 Maverick with 400 billion parameters, which until now could only be quantized by the simplest methods with a significant loss in quality.

The method has already been tested on popular models Llama 3 and Qwen2.5. Experiments have shown that HIGGS is the best quantization method in terms of quality-to-model size ratio among all existing data-free quantization methods, including NF4 (4-bit NormalFloat) and HQQ (Half-Quadratic Quantization).